Time |
Nickname |
Message |
00:14
🔗
|
|
asdf has joined #urlteam |
00:15
🔗
|
|
asdf has quit IRC (Remote host closed the connection) |
00:40
🔗
|
|
JesseW has joined #urlteam |
00:41
🔗
|
|
svchfoo1 sets mode: +o JesseW |
00:49
🔗
|
JesseW |
phuzion: letter looks good. At the end, after the incomplete sentence, I'd add "providing us a separate dump." |
00:50
🔗
|
JesseW |
regarding a description of ArchiveTeam, we could use the sentence on the front page: "a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage." |
01:44
🔗
|
|
asdf has joined #urlteam |
02:50
🔗
|
|
VADemon has quit IRC (left4dead) |
03:17
🔗
|
JesseW |
migre-me seems to be having some trouble right now -- paused grab |
04:13
🔗
|
|
logchfoo starts logging #urlteam at Mon Dec 21 04:13:30 2015 |
04:13
🔗
|
|
logchfoo has joined #urlteam |
04:54
🔗
|
|
dashcloud has quit IRC (Read error: Connection reset by peer) |
04:55
🔗
|
|
dashcloud has joined #urlteam |
04:55
🔗
|
|
svchfoo1 sets mode: +o dashcloud |
05:12
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
05:19
🔗
|
|
dashcloud has joined #urlteam |
05:19
🔗
|
|
svchfoo1 sets mode: +o dashcloud |
05:37
🔗
|
phuzion |
JesseW: around? |
05:38
🔗
|
JesseW |
yep |
05:38
🔗
|
* |
JesseW arises from the depths |
05:38
🔗
|
phuzion |
Can you take one final look at that email? |
05:41
🔗
|
JesseW |
change "If you could contact Archive.org and request that your dumps be made publicly available" to "If you contact Archive.org and they fix it so your dumps are publicly available" |
05:42
🔗
|
JesseW |
It's not good enough for them to *ask* -- the dumps need to actually *be* downloadable (and, for that matter downloaded) before we won't need to scrape. |
05:42
🔗
|
phuzion |
True. |
05:42
🔗
|
phuzion |
That change is made. Good to send? |
05:43
🔗
|
JesseW |
"again. #urlteam" -> "again. We're available in #urlteam" |
05:43
🔗
|
JesseW |
Otherwise it looks great! Thanks for writing it up. |
05:45
🔗
|
phuzion |
Sent. |
05:45
🔗
|
JesseW |
:-) |
05:47
🔗
|
phuzion |
I'm gonna see if there's anything else I can do on that list of still-alive shorteners on the wiki page |
05:47
🔗
|
JesseW |
yayayay -- thank you! |
05:47
🔗
|
phuzion |
No problem. |
05:49
🔗
|
JesseW |
If you want to write another letter -- there are about half a dozen 301works archives that should be made public, because the shortener has died -- I've been meaning to send a note in to info@archive, but haven't gotten around to it. Check out the Discontinued list. |
05:49
🔗
|
phuzion |
Eh, I really dislike writing letters, lol, I only wrote that one because they reached out to us directly. |
05:49
🔗
|
JesseW |
makes sense |
05:50
🔗
|
JesseW |
I'll get around to it eventually. I sort of don't want to bother Jeff K (the person who answers all the email) right after the telethon, which is one reason I'm waiting on it. |
05:50
🔗
|
phuzion |
totally understood. |
05:51
🔗
|
phuzion |
Total shorturl space is calculated by (number in alphabet)^(length of shorturl) right? |
05:51
🔗
|
phuzion |
number of characters in alpahbet*& |
05:51
🔗
|
JesseW |
assuming ^ is "to the power of", not bitwise XOR, yep |
05:52
🔗
|
phuzion |
ok, wanted to make sure I was doing my math right |
05:52
🔗
|
JesseW |
hm -- I'm going to add a table of common values for that to the page, now |
05:57
🔗
|
phuzion |
JesseW: Another shortener researched and ready to drop into the tracker: spne.ws. A-Z a-z 0-9, should only take a day or so, it's 3 characters or less from what I've seen. |
05:58
🔗
|
JesseW |
cool, adding it now! |
05:58
🔗
|
phuzion |
Only problem is that valid and invalid responses are both 301 responses. |
05:58
🔗
|
JesseW |
we can handle that now, assuming there's a regex for the invalid case |
05:58
🔗
|
phuzion |
Should be. |
05:59
🔗
|
JesseW |
looks like this regex would work: awesm=spne\.ws |
05:59
🔗
|
phuzion |
Cool |
05:59
🔗
|
JesseW |
nope |
06:00
🔗
|
JesseW |
af gives http://www.siliconprairienews.com/articles/4991?utm_source=direct-spne.ws&awesm=spne.ws_af&utm_medium=spne.ws-other&utm_content=api&utm_campaign= |
06:00
🔗
|
phuzion |
Oh. Lame. |
06:00
🔗
|
JesseW |
yeah, still investigating |
06:01
🔗
|
phuzion |
The full URL I get when throwing garbage at it is: http://www.siliconprairienews.com/?awesm=spne.ws_1&utm_medium=spne.ws-root&utm_source=direct-spne.ws&utm_content=root |
06:01
🔗
|
JesseW |
hm, I think I'll try www\.siliconprairienews\.com/\? |
06:02
🔗
|
JesseW |
maybe with awesm= appended |
06:02
🔗
|
phuzion |
Nah, it's a vanity URL shortener, there're going to use their own links on there a lot. |
06:02
🔗
|
JesseW |
right, but all their links don't have a question mark right after the domain name |
06:02
🔗
|
JesseW |
(all their links that aren't to their homepage) |
06:02
🔗
|
phuzion |
Ohhhhhhhhhh |
06:02
🔗
|
* |
phuzion headdesk |
06:03
🔗
|
JesseW |
is it uppercase and lowercase? |
06:03
🔗
|
phuzion |
Yep. A-Z a-z 0-9 |
06:04
🔗
|
JesseW |
cool |
06:04
🔗
|
JesseW |
started |
06:05
🔗
|
JesseW |
6 of the first 10 jobs went to Atluxity :-) |
06:09
🔗
|
JesseW |
sadly, it looks like a lot of the early URLs 404 on their site now. |
06:22
🔗
|
phuzion |
JesseW: I'm gonna just kick off an archivebot archive of that whole TLD, sound good? |
06:22
🔗
|
JesseW |
what, dot ws? |
06:22
🔗
|
phuzion |
er |
06:22
🔗
|
phuzion |
not the tld |
06:22
🔗
|
phuzion |
but the domain |
06:22
🔗
|
phuzion |
http://www.siliconprairienews.com/ |
06:22
🔗
|
JesseW |
How does one even *do* an archivebot of a whole TLD? |
06:23
🔗
|
JesseW |
Ahh, that makes more sense. |
06:23
🔗
|
phuzion |
lol, the same way we scrape an entire url shortener |
06:23
🔗
|
phuzion |
I bet a lot of people would be like "how the fuck do you even back up all of bit.ly or tinyurl?" |
06:24
🔗
|
JesseW |
archivebot isn't set up to do bruteforce searches, AFAIK. |
06:25
🔗
|
JesseW |
and most domain names are longer than 4 characters. :-) |
06:25
🔗
|
JesseW |
but sure, if someone wanted to code up an approriate bot, I suppose it would be equally possible to identify all the 2nd-level domains that way. |
06:25
🔗
|
phuzion |
Yeah, I know, I was just messing |
06:26
🔗
|
JesseW |
IIRC, someone wanted to do that for .no -- because the registrar was claiming the list was secret. |
06:27
🔗
|
phuzion |
Think my ISP would be pissed if I bruteforced the .no domain? |
06:27
🔗
|
JesseW |
I don't know your ISP. Mine probably wouldn't mind. :-) |
06:28
🔗
|
phuzion |
Pssh, my nameservers are 8.8.8.8, 8.8.4.4 and 4.2.2.2. Google and Level3 probably don't care. |
06:28
🔗
|
JesseW |
Probably not. :-) |
06:28
🔗
|
JesseW |
Feel free to look through the logs and contact whoever it was who wanted that. |
06:30
🔗
|
* |
phuzion wonders what the fastest method to do this would be... |
06:32
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
06:35
🔗
|
|
dashcloud has joined #urlteam |
06:35
🔗
|
|
svchfoo3 sets mode: +o dashcloud |
06:41
🔗
|
JesseW |
http://archiveteam.org/index.php?title=URLTeam#Common_numbers <- made the table I mentioned; feel free to add other ones |
06:42
🔗
|
JesseW |
I didn't think 2 character was particularly useful, as it's so small we'd mostly just start from/end at 0 or 3-characters. |
06:51
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
06:58
🔗
|
|
Coderjoe has joined #urlteam |
07:46
🔗
|
phuzion |
Got a response from the qr.cx operator: He just gave us a dump, licensed under CC-BY |
07:47
🔗
|
phuzion |
https://gist.github.com/phuzion/a29943a7979a2e5f3aa2 |
07:51
🔗
|
JesseW |
awesome! |
07:52
🔗
|
phuzion |
So, the file is a CSV, here's a sample line: "http://qr.cx/gCD" "http://flo.cx" "2009-06-16 00:00:00" |
07:52
🔗
|
phuzion |
That should be pretty damn easy to import. |
07:53
🔗
|
JesseW |
I'm not sure whether it's best to dump http://qr.cx/dataset/qrcx_all_06eec9b9-1f29-4860-bd91-49c2d517d87d.7z in archivebot, or upload it directly to IA and just tag it with urlteam, or both, or something else... |
07:53
🔗
|
JesseW |
please update the warrior job entry in any case |
07:55
🔗
|
phuzion |
Uhhhg, wikitables. If someone doesn't mess with it by tomorrow morning, I'll wrap my head around the wikitable and do it |
07:57
🔗
|
JesseW |
Just type what you want added to the comment, and I'll paste it in |
07:58
🔗
|
JesseW |
(I'd be open to switching to a different table format, too, if you have a suggestion) |
07:59
🔗
|
phuzion |
No, wikitables are fine, they're native to mediawiki, and they work, it just takes a bit of mental processing for me to wrap my head around them when I need to work with them. |
07:59
🔗
|
phuzion |
And it's like 3am in my timezone |
07:59
🔗
|
phuzion |
basically, I'd just move the wikitable thing to the "alive" section and say "We're not going to archive this for a bit because we have a full dump." or something |
08:02
🔗
|
JesseW |
ok. and -- go to sleep, it's 3am in your timezone. :-) |
08:02
🔗
|
JesseW |
(or not, as you wish) |
08:40
🔗
|
JesseW |
migre-me seems to be having network problems that are causing it to take more than a minute to get responses back, preventing the scraper from working. Please remind me to try turning the scraper back on in a few days. |
08:45
🔗
|
JesseW |
spne-ws done for now, thanks phuzion |
09:07
🔗
|
|
JesseW has quit IRC (Leaving.) |
09:17
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
09:31
🔗
|
|
dashcloud has joined #urlteam |
09:31
🔗
|
|
svchfoo1 sets mode: +o dashcloud |
09:36
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
09:42
🔗
|
|
dashcloud has joined #urlteam |
09:42
🔗
|
|
svchfoo3 sets mode: +o dashcloud |
13:32
🔗
|
|
WinterFox has quit IRC (Remote host closed the connection) |
15:39
🔗
|
|
VADemon has joined #urlteam |
15:42
🔗
|
|
asdf has quit IRC (Ping timeout: 252 seconds) |
16:40
🔗
|
|
dashcloud has quit IRC (Ping timeout: 250 seconds) |
16:49
🔗
|
|
dashcloud has joined #urlteam |
16:50
🔗
|
|
svchfoo1 sets mode: +o dashcloud |
17:15
🔗
|
|
JesseW has joined #urlteam |
17:16
🔗
|
|
svchfoo1 sets mode: +o JesseW |
17:36
🔗
|
phuzion |
at.cmt.com is an ow.ly domain it looks like. |
17:38
🔗
|
|
JesseW has quit IRC (Leaving.) |
17:54
🔗
|
|
VADemon has quit IRC (Read error: Operation timed out) |
19:24
🔗
|
phuzion |
poeurl.com should be a quick scrape. 3 characters long |
19:45
🔗
|
JW_work |
cool, will do when I'm home |
20:06
🔗
|
phuzion |
JW_work: Our of curiosity, how much of a hassle would it be for me to get an account on the tracker so I can do little ones like that from time to time? |
20:21
🔗
|
JW_work |
No trouble at all. In fact, I'm delighted to get more hands involved. I'll make you one asap; /msg me your desired username (presumably phuzion) and initial password. |
20:22
🔗
|
phuzion |
Is there a method to change passwords? |
20:23
🔗
|
phuzion |
PM'd a username/password to you |
20:24
🔗
|
|
JW_work1 has joined #urlteam |
20:26
🔗
|
|
JW_work has quit IRC (Ping timeout: 255 seconds) |
20:33
🔗
|
|
JW_work1 has quit IRC (Ping timeout: 260 seconds) |
20:42
🔗
|
|
JW_work has joined #urlteam |
22:17
🔗
|
|
dashcloud has quit IRC (Read error: Connection reset by peer) |
22:17
🔗
|
|
dashcloud has joined #urlteam |
22:18
🔗
|
|
svchfoo1 sets mode: +o dashcloud |
22:26
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
22:30
🔗
|
|
dashcloud has joined #urlteam |
22:30
🔗
|
|
svchfoo3 sets mode: +o dashcloud |
23:36
🔗
|
|
WinterFox has joined #urlteam |