[00:14] *** asdf has joined #urlteam
[00:15] *** asdf has quit IRC (Remote host closed the connection)
[00:40] *** JesseW has joined #urlteam
[00:41] *** svchfoo1 sets mode: +o JesseW
[00:49] <JesseW> phuzion: letter looks good. At the end, after the incomplete sentence, I'd add "providing us a separate dump."
[00:50] <JesseW> regarding a description of ArchiveTeam, we could use the sentence on the front page: "a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage."
[01:44] *** asdf has joined #urlteam
[02:50] *** VADemon has quit IRC (left4dead)
[03:17] <JesseW> migre-me seems to be having some trouble right now -- paused grab
[04:13] *** logchfoo starts logging #urlteam at Mon Dec 21 04:13:30 2015
[04:13] *** logchfoo has joined #urlteam
[04:54] *** dashcloud has quit IRC (Read error: Connection reset by peer)
[04:55] *** dashcloud has joined #urlteam
[04:55] *** svchfoo1 sets mode: +o dashcloud
[05:12] *** dashcloud has quit IRC (Read error: Operation timed out)
[05:19] *** dashcloud has joined #urlteam
[05:19] *** svchfoo1 sets mode: +o dashcloud
[05:37] <phuzion> JesseW: around?
[05:38] <JesseW> yep
[05:38] * JesseW arises from the depths
[05:38] <phuzion> Can you take one final look at that email?
[05:41] <JesseW> change "If you could contact Archive.org and request that your dumps be made publicly available" to "If you contact Archive.org and they fix it so your dumps are publicly available"
[05:42] <JesseW> It's not good enough for them to *ask* -- the dumps need to actually *be* downloadable (and, for that matter downloaded) before we won't need to scrape.
[05:42] <phuzion> True.
[05:42] <phuzion> That change is made. Good to send?
[05:43] <JesseW> "again. #urlteam" -> "again. We're available in #urlteam"
[05:43] <JesseW> Otherwise it looks great! Thanks for writing it up.
[05:45] <phuzion> Sent.
[05:45] <JesseW> :-)
[05:47] <phuzion> I'm gonna see if there's anything else I can do on that list of still-alive shorteners on the wiki page
[05:47] <JesseW> yayayay -- thank you!
[05:47] <phuzion> No problem.
[05:49] <JesseW> If you want to write another letter -- there are about half a dozen 301works archives that should be made public, because the shortener has died -- I've been meaning to send a note in to info@archive, but haven't gotten around to it. Check out the Discontinued list.
[05:49] <phuzion> Eh, I really dislike writing letters, lol, I only wrote that one because they reached out to us directly.
[05:49] <JesseW> makes sense
[05:50] <JesseW> I'll get around to it eventually. I sort of don't want to bother Jeff K (the person who answers all the email) right after the telethon, which is one reason I'm waiting on it.
[05:50] <phuzion> totally understood.
[05:51] <phuzion> Total shorturl space is calculated by (number in alphabet)^(length of shorturl) right?
[05:51] <phuzion> number of characters in alpahbet*&
[05:51] <JesseW> assuming ^ is "to the power of", not bitwise XOR, yep
[05:52] <phuzion> ok, wanted to make sure I was doing my math right
[05:52] <JesseW> hm -- I'm going to add a table of common values for that to the page, now
[05:57] <phuzion> JesseW: Another shortener researched and ready to drop into the tracker: spne.ws. A-Z a-z 0-9, should only take a day or so, it's 3 characters or less from what I've seen.
[05:58] <JesseW> cool, adding it now!
[05:58] <phuzion> Only problem is that valid and invalid responses are both 301 responses.
[05:58] <JesseW> we can handle that now, assuming there's a regex for the invalid case
[05:58] <phuzion> Should be.
[05:59] <JesseW> looks like this regex would work: awesm=spne\.ws
[05:59] <phuzion> Cool
[05:59] <JesseW> nope
[06:00] <JesseW> af gives http://www.siliconprairienews.com/articles/4991?utm_source=direct-spne.ws&awesm=spne.ws_af&utm_medium=spne.ws-other&utm_content=api&utm_campaign=
[06:00] <phuzion> Oh. Lame.
[06:00] <JesseW> yeah, still investigating
[06:01] <phuzion> The full URL I get when throwing garbage at it is: http://www.siliconprairienews.com/?awesm=spne.ws_1&utm_medium=spne.ws-root&utm_source=direct-spne.ws&utm_content=root
[06:01] <JesseW> hm, I think I'll try www\.siliconprairienews\.com/\?
[06:02] <JesseW> maybe with awesm= appended
[06:02] <phuzion> Nah, it's a vanity URL shortener, there're going to use their own links on there a lot.
[06:02] <JesseW> right, but all their links don't have a question mark right after the domain name
[06:02] <JesseW> (all their links that aren't to their homepage)
[06:02] <phuzion> Ohhhhhhhhhh
[06:02] * phuzion headdesk
[06:03] <JesseW> is it uppercase and lowercase?
[06:03] <phuzion> Yep. A-Z a-z 0-9
[06:04] <JesseW> cool
[06:04] <JesseW> started
[06:05] <JesseW> 6 of the first 10 jobs went to Atluxity :-)
[06:09] <JesseW> sadly, it looks like a lot of the early URLs 404 on their site now.
[06:22] <phuzion> JesseW: I'm gonna just kick off an archivebot archive of that whole TLD, sound good?
[06:22] <JesseW> what, dot ws?
[06:22] <phuzion> er
[06:22] <phuzion> not the tld
[06:22] <phuzion> but the domain
[06:22] <phuzion> http://www.siliconprairienews.com/
[06:22] <JesseW> How does one even *do* an archivebot of a whole TLD?
[06:23] <JesseW> Ahh, that makes more sense.
[06:23] <phuzion> lol, the same way we scrape an entire url shortener
[06:23] <phuzion> I bet a lot of people would be like "how the fuck do you even back up all of bit.ly or tinyurl?"
[06:24] <JesseW> archivebot isn't set up to do bruteforce searches, AFAIK.
[06:25] <JesseW> and most domain names are longer than 4 characters. :-)
[06:25] <JesseW> but sure, if someone wanted to code up an approriate bot, I suppose it would be equally possible to identify all the 2nd-level domains that way.
[06:25] <phuzion> Yeah, I know, I was just messing
[06:26] <JesseW> IIRC, someone wanted to do that for .no -- because the registrar was claiming the list was secret.
[06:27] <phuzion> Think my ISP would be pissed if I bruteforced the .no domain?
[06:27] <JesseW> I don't know your ISP. Mine probably wouldn't mind. :-)
[06:28] <phuzion> Pssh, my nameservers are 8.8.8.8, 8.8.4.4 and 4.2.2.2. Google and Level3 probably don't care.
[06:28] <JesseW> Probably not. :-)
[06:28] <JesseW> Feel free to look through the logs and contact whoever it was who wanted that.
[06:30] * phuzion wonders what the fastest method to do this would be...
[06:32] *** dashcloud has quit IRC (Read error: Operation timed out)
[06:35] *** dashcloud has joined #urlteam
[06:35] *** svchfoo3 sets mode: +o dashcloud
[06:41] <JesseW> http://archiveteam.org/index.php?title=URLTeam#Common_numbers <- made the table I mentioned; feel free to add other ones
[06:42] <JesseW> I didn't think 2 character was particularly useful, as it's so small we'd mostly just start from/end at 0 or 3-characters.
[06:51] *** Coderjoe has quit IRC (Read error: Operation timed out)
[06:58] *** Coderjoe has joined #urlteam
[07:46] <phuzion> Got a response from the qr.cx operator: He just gave us a dump, licensed under CC-BY
[07:47] <phuzion> https://gist.github.com/phuzion/a29943a7979a2e5f3aa2
[07:51] <JesseW> awesome!
[07:52] <phuzion> So, the file is a CSV, here's a sample line: "http://qr.cx/gCD"	"http://flo.cx"	"2009-06-16 00:00:00"
[07:52] <phuzion> That should be pretty damn easy to import.
[07:53] <JesseW> I'm not sure whether it's best to dump http://qr.cx/dataset/qrcx_all_06eec9b9-1f29-4860-bd91-49c2d517d87d.7z in archivebot, or upload it directly to IA and just tag it with urlteam, or both, or something else...
[07:53] <JesseW> please update the warrior job entry in any case
[07:55] <phuzion> Uhhhg, wikitables. If someone doesn't mess with it by tomorrow morning, I'll wrap my head around the wikitable and do it
[07:57] <JesseW> Just type what you want added to the comment, and I'll paste it in
[07:58] <JesseW> (I'd be open to switching to a different table format, too, if you have a suggestion)
[07:59] <phuzion> No, wikitables are fine, they're native to mediawiki, and they work, it just takes a bit of mental processing for me to wrap my head around them when I need to work with them.
[07:59] <phuzion> And it's like 3am in my timezone
[07:59] <phuzion> basically, I'd just move the wikitable thing to the "alive" section and say "We're not going to archive this for a bit because we have a full dump." or something
[08:02] <JesseW> ok. and -- go to sleep, it's 3am in your timezone. :-)
[08:02] <JesseW> (or not, as you wish)
[08:40] <JesseW> migre-me seems to be having network problems that are causing it to take more than a minute to get responses back, preventing the scraper from working. Please remind me to try turning the scraper back on in a few days.
[08:45] <JesseW> spne-ws done for now, thanks phuzion
[09:07] *** JesseW has quit IRC (Leaving.)
[09:17] *** dashcloud has quit IRC (Read error: Operation timed out)
[09:31] *** dashcloud has joined #urlteam
[09:31] *** svchfoo1 sets mode: +o dashcloud
[09:36] *** dashcloud has quit IRC (Read error: Operation timed out)
[09:42] *** dashcloud has joined #urlteam
[09:42] *** svchfoo3 sets mode: +o dashcloud
[13:32] *** WinterFox has quit IRC (Remote host closed the connection)
[15:39] *** VADemon has joined #urlteam
[15:42] *** asdf has quit IRC (Ping timeout: 252 seconds)
[16:40] *** dashcloud has quit IRC (Ping timeout: 250 seconds)
[16:49] *** dashcloud has joined #urlteam
[16:50] *** svchfoo1 sets mode: +o dashcloud
[17:15] *** JesseW has joined #urlteam
[17:16] *** svchfoo1 sets mode: +o JesseW
[17:36] <phuzion> at.cmt.com is an ow.ly domain it looks like.
[17:38] *** JesseW has quit IRC (Leaving.)
[17:54] *** VADemon has quit IRC (Read error: Operation timed out)
[19:24] <phuzion> poeurl.com should be a quick scrape. 3 characters long
[19:45] <JW_work> cool, will do when I'm home
[20:06] <phuzion> JW_work: Our of curiosity, how much of a hassle would it be for me to get an account on the tracker so I can do little ones like that from time to time?
[20:21] <JW_work> No trouble at all. In fact, I'm delighted to get more hands involved. I'll make you one asap; /msg me your desired username (presumably phuzion) and initial password.
[20:22] <phuzion> Is there a method to change passwords?
[20:23] <phuzion> PM'd a username/password to you
[20:24] *** JW_work1 has joined #urlteam
[20:26] *** JW_work has quit IRC (Ping timeout: 255 seconds)
[20:33] *** JW_work1 has quit IRC (Ping timeout: 260 seconds)
[20:42] *** JW_work has joined #urlteam
[22:17] *** dashcloud has quit IRC (Read error: Connection reset by peer)
[22:17] *** dashcloud has joined #urlteam
[22:18] *** svchfoo1 sets mode: +o dashcloud
[22:26] *** dashcloud has quit IRC (Read error: Operation timed out)
[22:30] *** dashcloud has joined #urlteam
[22:30] *** svchfoo3 sets mode: +o dashcloud
[23:36] *** WinterFox has joined #urlteam