#urlteam 2015-12-21,Mon

↑back Search

Time Nickname Message
00:14 🔗 asdf has joined #urlteam
00:15 🔗 asdf has quit IRC (Remote host closed the connection)
00:40 🔗 JesseW has joined #urlteam
00:41 🔗 svchfoo1 sets mode: +o JesseW
00:49 🔗 JesseW phuzion: letter looks good. At the end, after the incomplete sentence, I'd add "providing us a separate dump."
00:50 🔗 JesseW regarding a description of ArchiveTeam, we could use the sentence on the front page: "a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage."
01:44 🔗 asdf has joined #urlteam
02:50 🔗 VADemon has quit IRC (left4dead)
03:17 🔗 JesseW migre-me seems to be having some trouble right now -- paused grab
04:13 🔗 logchfoo starts logging #urlteam at Mon Dec 21 04:13:30 2015
04:13 🔗 logchfoo has joined #urlteam
04:54 🔗 dashcloud has quit IRC (Read error: Connection reset by peer)
04:55 🔗 dashcloud has joined #urlteam
04:55 🔗 svchfoo1 sets mode: +o dashcloud
05:12 🔗 dashcloud has quit IRC (Read error: Operation timed out)
05:19 🔗 dashcloud has joined #urlteam
05:19 🔗 svchfoo1 sets mode: +o dashcloud
05:37 🔗 phuzion JesseW: around?
05:38 🔗 JesseW yep
05:38 🔗 * JesseW arises from the depths
05:38 🔗 phuzion Can you take one final look at that email?
05:41 🔗 JesseW change "If you could contact Archive.org and request that your dumps be made publicly available" to "If you contact Archive.org and they fix it so your dumps are publicly available"
05:42 🔗 JesseW It's not good enough for them to *ask* -- the dumps need to actually *be* downloadable (and, for that matter downloaded) before we won't need to scrape.
05:42 🔗 phuzion True.
05:42 🔗 phuzion That change is made. Good to send?
05:43 🔗 JesseW "again. #urlteam" -> "again. We're available in #urlteam"
05:43 🔗 JesseW Otherwise it looks great! Thanks for writing it up.
05:45 🔗 phuzion Sent.
05:45 🔗 JesseW :-)
05:47 🔗 phuzion I'm gonna see if there's anything else I can do on that list of still-alive shorteners on the wiki page
05:47 🔗 JesseW yayayay -- thank you!
05:47 🔗 phuzion No problem.
05:49 🔗 JesseW If you want to write another letter -- there are about half a dozen 301works archives that should be made public, because the shortener has died -- I've been meaning to send a note in to info@archive, but haven't gotten around to it. Check out the Discontinued list.
05:49 🔗 phuzion Eh, I really dislike writing letters, lol, I only wrote that one because they reached out to us directly.
05:49 🔗 JesseW makes sense
05:50 🔗 JesseW I'll get around to it eventually. I sort of don't want to bother Jeff K (the person who answers all the email) right after the telethon, which is one reason I'm waiting on it.
05:50 🔗 phuzion totally understood.
05:51 🔗 phuzion Total shorturl space is calculated by (number in alphabet)^(length of shorturl) right?
05:51 🔗 phuzion number of characters in alpahbet*&
05:51 🔗 JesseW assuming ^ is "to the power of", not bitwise XOR, yep
05:52 🔗 phuzion ok, wanted to make sure I was doing my math right
05:52 🔗 JesseW hm -- I'm going to add a table of common values for that to the page, now
05:57 🔗 phuzion JesseW: Another shortener researched and ready to drop into the tracker: spne.ws. A-Z a-z 0-9, should only take a day or so, it's 3 characters or less from what I've seen.
05:58 🔗 JesseW cool, adding it now!
05:58 🔗 phuzion Only problem is that valid and invalid responses are both 301 responses.
05:58 🔗 JesseW we can handle that now, assuming there's a regex for the invalid case
05:58 🔗 phuzion Should be.
05:59 🔗 JesseW looks like this regex would work: awesm=spne\.ws
05:59 🔗 phuzion Cool
05:59 🔗 JesseW nope
06:00 🔗 JesseW af gives http://www.siliconprairienews.com/articles/4991?utm_source=direct-spne.ws&awesm=spne.ws_af&utm_medium=spne.ws-other&utm_content=api&utm_campaign=
06:00 🔗 phuzion Oh. Lame.
06:00 🔗 JesseW yeah, still investigating
06:01 🔗 phuzion The full URL I get when throwing garbage at it is: http://www.siliconprairienews.com/?awesm=spne.ws_1&utm_medium=spne.ws-root&utm_source=direct-spne.ws&utm_content=root
06:01 🔗 JesseW hm, I think I'll try www\.siliconprairienews\.com/\?
06:02 🔗 JesseW maybe with awesm= appended
06:02 🔗 phuzion Nah, it's a vanity URL shortener, there're going to use their own links on there a lot.
06:02 🔗 JesseW right, but all their links don't have a question mark right after the domain name
06:02 🔗 JesseW (all their links that aren't to their homepage)
06:02 🔗 phuzion Ohhhhhhhhhh
06:02 🔗 * phuzion headdesk
06:03 🔗 JesseW is it uppercase and lowercase?
06:03 🔗 phuzion Yep. A-Z a-z 0-9
06:04 🔗 JesseW cool
06:04 🔗 JesseW started
06:05 🔗 JesseW 6 of the first 10 jobs went to Atluxity :-)
06:09 🔗 JesseW sadly, it looks like a lot of the early URLs 404 on their site now.
06:22 🔗 phuzion JesseW: I'm gonna just kick off an archivebot archive of that whole TLD, sound good?
06:22 🔗 JesseW what, dot ws?
06:22 🔗 phuzion er
06:22 🔗 phuzion not the tld
06:22 🔗 phuzion but the domain
06:22 🔗 phuzion http://www.siliconprairienews.com/
06:22 🔗 JesseW How does one even *do* an archivebot of a whole TLD?
06:23 🔗 JesseW Ahh, that makes more sense.
06:23 🔗 phuzion lol, the same way we scrape an entire url shortener
06:23 🔗 phuzion I bet a lot of people would be like "how the fuck do you even back up all of bit.ly or tinyurl?"
06:24 🔗 JesseW archivebot isn't set up to do bruteforce searches, AFAIK.
06:25 🔗 JesseW and most domain names are longer than 4 characters. :-)
06:25 🔗 JesseW but sure, if someone wanted to code up an approriate bot, I suppose it would be equally possible to identify all the 2nd-level domains that way.
06:25 🔗 phuzion Yeah, I know, I was just messing
06:26 🔗 JesseW IIRC, someone wanted to do that for .no -- because the registrar was claiming the list was secret.
06:27 🔗 phuzion Think my ISP would be pissed if I bruteforced the .no domain?
06:27 🔗 JesseW I don't know your ISP. Mine probably wouldn't mind. :-)
06:28 🔗 phuzion Pssh, my nameservers are 8.8.8.8, 8.8.4.4 and 4.2.2.2. Google and Level3 probably don't care.
06:28 🔗 JesseW Probably not. :-)
06:28 🔗 JesseW Feel free to look through the logs and contact whoever it was who wanted that.
06:30 🔗 * phuzion wonders what the fastest method to do this would be...
06:32 🔗 dashcloud has quit IRC (Read error: Operation timed out)
06:35 🔗 dashcloud has joined #urlteam
06:35 🔗 svchfoo3 sets mode: +o dashcloud
06:41 🔗 JesseW http://archiveteam.org/index.php?title=URLTeam#Common_numbers <- made the table I mentioned; feel free to add other ones
06:42 🔗 JesseW I didn't think 2 character was particularly useful, as it's so small we'd mostly just start from/end at 0 or 3-characters.
06:51 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
06:58 🔗 Coderjoe has joined #urlteam
07:46 🔗 phuzion Got a response from the qr.cx operator: He just gave us a dump, licensed under CC-BY
07:47 🔗 phuzion https://gist.github.com/phuzion/a29943a7979a2e5f3aa2
07:51 🔗 JesseW awesome!
07:52 🔗 phuzion So, the file is a CSV, here's a sample line: "http://qr.cx/gCD" "http://flo.cx" "2009-06-16 00:00:00"
07:52 🔗 phuzion That should be pretty damn easy to import.
07:53 🔗 JesseW I'm not sure whether it's best to dump http://qr.cx/dataset/qrcx_all_06eec9b9-1f29-4860-bd91-49c2d517d87d.7z in archivebot, or upload it directly to IA and just tag it with urlteam, or both, or something else...
07:53 🔗 JesseW please update the warrior job entry in any case
07:55 🔗 phuzion Uhhhg, wikitables. If someone doesn't mess with it by tomorrow morning, I'll wrap my head around the wikitable and do it
07:57 🔗 JesseW Just type what you want added to the comment, and I'll paste it in
07:58 🔗 JesseW (I'd be open to switching to a different table format, too, if you have a suggestion)
07:59 🔗 phuzion No, wikitables are fine, they're native to mediawiki, and they work, it just takes a bit of mental processing for me to wrap my head around them when I need to work with them.
07:59 🔗 phuzion And it's like 3am in my timezone
07:59 🔗 phuzion basically, I'd just move the wikitable thing to the "alive" section and say "We're not going to archive this for a bit because we have a full dump." or something
08:02 🔗 JesseW ok. and -- go to sleep, it's 3am in your timezone. :-)
08:02 🔗 JesseW (or not, as you wish)
08:40 🔗 JesseW migre-me seems to be having network problems that are causing it to take more than a minute to get responses back, preventing the scraper from working. Please remind me to try turning the scraper back on in a few days.
08:45 🔗 JesseW spne-ws done for now, thanks phuzion
09:07 🔗 JesseW has quit IRC (Leaving.)
09:17 🔗 dashcloud has quit IRC (Read error: Operation timed out)
09:31 🔗 dashcloud has joined #urlteam
09:31 🔗 svchfoo1 sets mode: +o dashcloud
09:36 🔗 dashcloud has quit IRC (Read error: Operation timed out)
09:42 🔗 dashcloud has joined #urlteam
09:42 🔗 svchfoo3 sets mode: +o dashcloud
13:32 🔗 WinterFox has quit IRC (Remote host closed the connection)
15:39 🔗 VADemon has joined #urlteam
15:42 🔗 asdf has quit IRC (Ping timeout: 252 seconds)
16:40 🔗 dashcloud has quit IRC (Ping timeout: 250 seconds)
16:49 🔗 dashcloud has joined #urlteam
16:50 🔗 svchfoo1 sets mode: +o dashcloud
17:15 🔗 JesseW has joined #urlteam
17:16 🔗 svchfoo1 sets mode: +o JesseW
17:36 🔗 phuzion at.cmt.com is an ow.ly domain it looks like.
17:38 🔗 JesseW has quit IRC (Leaving.)
17:54 🔗 VADemon has quit IRC (Read error: Operation timed out)
19:24 🔗 phuzion poeurl.com should be a quick scrape. 3 characters long
19:45 🔗 JW_work cool, will do when I'm home
20:06 🔗 phuzion JW_work: Our of curiosity, how much of a hassle would it be for me to get an account on the tracker so I can do little ones like that from time to time?
20:21 🔗 JW_work No trouble at all. In fact, I'm delighted to get more hands involved. I'll make you one asap; /msg me your desired username (presumably phuzion) and initial password.
20:22 🔗 phuzion Is there a method to change passwords?
20:23 🔗 phuzion PM'd a username/password to you
20:24 🔗 JW_work1 has joined #urlteam
20:26 🔗 JW_work has quit IRC (Ping timeout: 255 seconds)
20:33 🔗 JW_work1 has quit IRC (Ping timeout: 260 seconds)
20:42 🔗 JW_work has joined #urlteam
22:17 🔗 dashcloud has quit IRC (Read error: Connection reset by peer)
22:17 🔗 dashcloud has joined #urlteam
22:18 🔗 svchfoo1 sets mode: +o dashcloud
22:26 🔗 dashcloud has quit IRC (Read error: Operation timed out)
22:30 🔗 dashcloud has joined #urlteam
22:30 🔗 svchfoo3 sets mode: +o dashcloud
23:36 🔗 WinterFox has joined #urlteam

irclogger-viewer