[00:14] *** asdf has joined #urlteam [00:15] *** asdf has quit IRC (Remote host closed the connection) [00:40] *** JesseW has joined #urlteam [00:41] *** svchfoo1 sets mode: +o JesseW [00:49] phuzion: letter looks good. At the end, after the incomplete sentence, I'd add "providing us a separate dump." [00:50] regarding a description of ArchiveTeam, we could use the sentence on the front page: "a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage." [01:44] *** asdf has joined #urlteam [02:50] *** VADemon has quit IRC (left4dead) [03:17] migre-me seems to be having some trouble right now -- paused grab [04:13] *** logchfoo starts logging #urlteam at Mon Dec 21 04:13:30 2015 [04:13] *** logchfoo has joined #urlteam [04:54] *** dashcloud has quit IRC (Read error: Connection reset by peer) [04:55] *** dashcloud has joined #urlteam [04:55] *** svchfoo1 sets mode: +o dashcloud [05:12] *** dashcloud has quit IRC (Read error: Operation timed out) [05:19] *** dashcloud has joined #urlteam [05:19] *** svchfoo1 sets mode: +o dashcloud [05:37] JesseW: around? [05:38] yep [05:38] * JesseW arises from the depths [05:38] Can you take one final look at that email? [05:41] change "If you could contact Archive.org and request that your dumps be made publicly available" to "If you contact Archive.org and they fix it so your dumps are publicly available" [05:42] It's not good enough for them to *ask* -- the dumps need to actually *be* downloadable (and, for that matter downloaded) before we won't need to scrape. [05:42] True. [05:42] That change is made. Good to send? [05:43] "again. #urlteam" -> "again. We're available in #urlteam" [05:43] Otherwise it looks great! Thanks for writing it up. [05:45] Sent. [05:45] :-) [05:47] I'm gonna see if there's anything else I can do on that list of still-alive shorteners on the wiki page [05:47] yayayay -- thank you! [05:47] No problem. [05:49] If you want to write another letter -- there are about half a dozen 301works archives that should be made public, because the shortener has died -- I've been meaning to send a note in to info@archive, but haven't gotten around to it. Check out the Discontinued list. [05:49] Eh, I really dislike writing letters, lol, I only wrote that one because they reached out to us directly. [05:49] makes sense [05:50] I'll get around to it eventually. I sort of don't want to bother Jeff K (the person who answers all the email) right after the telethon, which is one reason I'm waiting on it. [05:50] totally understood. [05:51] Total shorturl space is calculated by (number in alphabet)^(length of shorturl) right? [05:51] number of characters in alpahbet*& [05:51] assuming ^ is "to the power of", not bitwise XOR, yep [05:52] ok, wanted to make sure I was doing my math right [05:52] hm -- I'm going to add a table of common values for that to the page, now [05:57] JesseW: Another shortener researched and ready to drop into the tracker: spne.ws. A-Z a-z 0-9, should only take a day or so, it's 3 characters or less from what I've seen. [05:58] cool, adding it now! [05:58] Only problem is that valid and invalid responses are both 301 responses. [05:58] we can handle that now, assuming there's a regex for the invalid case [05:58] Should be. [05:59] looks like this regex would work: awesm=spne\.ws [05:59] Cool [05:59] nope [06:00] af gives http://www.siliconprairienews.com/articles/4991?utm_source=direct-spne.ws&awesm=spne.ws_af&utm_medium=spne.ws-other&utm_content=api&utm_campaign= [06:00] Oh. Lame. [06:00] yeah, still investigating [06:01] The full URL I get when throwing garbage at it is: http://www.siliconprairienews.com/?awesm=spne.ws_1&utm_medium=spne.ws-root&utm_source=direct-spne.ws&utm_content=root [06:01] hm, I think I'll try www\.siliconprairienews\.com/\? [06:02] maybe with awesm= appended [06:02] Nah, it's a vanity URL shortener, there're going to use their own links on there a lot. [06:02] right, but all their links don't have a question mark right after the domain name [06:02] (all their links that aren't to their homepage) [06:02] Ohhhhhhhhhh [06:02] * phuzion headdesk [06:03] is it uppercase and lowercase? [06:03] Yep. A-Z a-z 0-9 [06:04] cool [06:04] started [06:05] 6 of the first 10 jobs went to Atluxity :-) [06:09] sadly, it looks like a lot of the early URLs 404 on their site now. [06:22] JesseW: I'm gonna just kick off an archivebot archive of that whole TLD, sound good? [06:22] what, dot ws? [06:22] er [06:22] not the tld [06:22] but the domain [06:22] http://www.siliconprairienews.com/ [06:22] How does one even *do* an archivebot of a whole TLD? [06:23] Ahh, that makes more sense. [06:23] lol, the same way we scrape an entire url shortener [06:23] I bet a lot of people would be like "how the fuck do you even back up all of bit.ly or tinyurl?" [06:24] archivebot isn't set up to do bruteforce searches, AFAIK. [06:25] and most domain names are longer than 4 characters. :-) [06:25] but sure, if someone wanted to code up an approriate bot, I suppose it would be equally possible to identify all the 2nd-level domains that way. [06:25] Yeah, I know, I was just messing [06:26] IIRC, someone wanted to do that for .no -- because the registrar was claiming the list was secret. [06:27] Think my ISP would be pissed if I bruteforced the .no domain? [06:27] I don't know your ISP. Mine probably wouldn't mind. :-) [06:28] Pssh, my nameservers are 8.8.8.8, 8.8.4.4 and 4.2.2.2. Google and Level3 probably don't care. [06:28] Probably not. :-) [06:28] Feel free to look through the logs and contact whoever it was who wanted that. [06:30] * phuzion wonders what the fastest method to do this would be... [06:32] *** dashcloud has quit IRC (Read error: Operation timed out) [06:35] *** dashcloud has joined #urlteam [06:35] *** svchfoo3 sets mode: +o dashcloud [06:41] http://archiveteam.org/index.php?title=URLTeam#Common_numbers <- made the table I mentioned; feel free to add other ones [06:42] I didn't think 2 character was particularly useful, as it's so small we'd mostly just start from/end at 0 or 3-characters. [06:51] *** Coderjoe has quit IRC (Read error: Operation timed out) [06:58] *** Coderjoe has joined #urlteam [07:46] Got a response from the qr.cx operator: He just gave us a dump, licensed under CC-BY [07:47] https://gist.github.com/phuzion/a29943a7979a2e5f3aa2 [07:51] awesome! [07:52] So, the file is a CSV, here's a sample line: "http://qr.cx/gCD" "http://flo.cx" "2009-06-16 00:00:00" [07:52] That should be pretty damn easy to import. [07:53] I'm not sure whether it's best to dump http://qr.cx/dataset/qrcx_all_06eec9b9-1f29-4860-bd91-49c2d517d87d.7z in archivebot, or upload it directly to IA and just tag it with urlteam, or both, or something else... [07:53] please update the warrior job entry in any case [07:55] Uhhhg, wikitables. If someone doesn't mess with it by tomorrow morning, I'll wrap my head around the wikitable and do it [07:57] Just type what you want added to the comment, and I'll paste it in [07:58] (I'd be open to switching to a different table format, too, if you have a suggestion) [07:59] No, wikitables are fine, they're native to mediawiki, and they work, it just takes a bit of mental processing for me to wrap my head around them when I need to work with them. [07:59] And it's like 3am in my timezone [07:59] basically, I'd just move the wikitable thing to the "alive" section and say "We're not going to archive this for a bit because we have a full dump." or something [08:02] ok. and -- go to sleep, it's 3am in your timezone. :-) [08:02] (or not, as you wish) [08:40] migre-me seems to be having network problems that are causing it to take more than a minute to get responses back, preventing the scraper from working. Please remind me to try turning the scraper back on in a few days. [08:45] spne-ws done for now, thanks phuzion [09:07] *** JesseW has quit IRC (Leaving.) [09:17] *** dashcloud has quit IRC (Read error: Operation timed out) [09:31] *** dashcloud has joined #urlteam [09:31] *** svchfoo1 sets mode: +o dashcloud [09:36] *** dashcloud has quit IRC (Read error: Operation timed out) [09:42] *** dashcloud has joined #urlteam [09:42] *** svchfoo3 sets mode: +o dashcloud [13:32] *** WinterFox has quit IRC (Remote host closed the connection) [15:39] *** VADemon has joined #urlteam [15:42] *** asdf has quit IRC (Ping timeout: 252 seconds) [16:40] *** dashcloud has quit IRC (Ping timeout: 250 seconds) [16:49] *** dashcloud has joined #urlteam [16:50] *** svchfoo1 sets mode: +o dashcloud [17:15] *** JesseW has joined #urlteam [17:16] *** svchfoo1 sets mode: +o JesseW [17:36] at.cmt.com is an ow.ly domain it looks like. [17:38] *** JesseW has quit IRC (Leaving.) [17:54] *** VADemon has quit IRC (Read error: Operation timed out) [19:24] poeurl.com should be a quick scrape. 3 characters long [19:45] cool, will do when I'm home [20:06] JW_work: Our of curiosity, how much of a hassle would it be for me to get an account on the tracker so I can do little ones like that from time to time? [20:21] No trouble at all. In fact, I'm delighted to get more hands involved. I'll make you one asap; /msg me your desired username (presumably phuzion) and initial password. [20:22] Is there a method to change passwords? [20:23] PM'd a username/password to you [20:24] *** JW_work1 has joined #urlteam [20:26] *** JW_work has quit IRC (Ping timeout: 255 seconds) [20:33] *** JW_work1 has quit IRC (Ping timeout: 260 seconds) [20:42] *** JW_work has joined #urlteam [22:17] *** dashcloud has quit IRC (Read error: Connection reset by peer) [22:17] *** dashcloud has joined #urlteam [22:18] *** svchfoo1 sets mode: +o dashcloud [22:26] *** dashcloud has quit IRC (Read error: Operation timed out) [22:30] *** dashcloud has joined #urlteam [22:30] *** svchfoo3 sets mode: +o dashcloud [23:36] *** WinterFox has joined #urlteam