#urlteam 2015-11-14,Sat

↑back Search

Time Nickname Message
00:12 🔗 bwn has quit IRC (Read error: Operation timed out)
00:24 🔗 aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
00:25 🔗 aaaaaaaaa has joined #urlteam
00:25 🔗 swebb sets mode: +o aaaaaaaaa
01:14 🔗 JesseW has joined #urlteam
01:15 🔗 svchfoo3 sets mode: +o JesseW
01:20 🔗 xmc has quit IRC (Read error: Operation timed out)
01:21 🔗 JesseW OK, here's the 20,141 unique URLs containing "adrive.com" in the URLteam corpus: https://gist.github.com/JesseWeinstein/f1287df2ca12b1d96705/raw/3fb99fcea51fa990ded6e67c642e7a7c69a08aa3/gistfile1.txt
01:21 🔗 svchfoo1 has quit IRC (Read error: Operation timed out)
01:23 🔗 xmc has joined #urlteam
01:23 🔗 swebb sets mode: +o xmc
01:23 🔗 Fusl has quit IRC (Ping timeout: 255 seconds)
01:23 🔗 JesseW I need to figure out a better way to handle migre.me -- for now, I've turned it off.
01:23 🔗 svchfoo3 has quit IRC (Ping timeout: 369 seconds)
01:25 🔗 svchfoo3 has joined #urlteam
01:25 🔗 chazchaz has quit IRC (Ping timeout: 369 seconds)
01:27 🔗 phuzion has quit IRC (Ping timeout: 369 seconds)
01:27 🔗 atlogbot has quit IRC (Ping timeout: 369 seconds)
01:27 🔗 phuzion has joined #urlteam
01:27 🔗 JesseW has quit IRC (Leaving.)
01:28 🔗 aaaaaaaaa sets mode: +o svchfoo3
01:29 🔗 Fusl has joined #urlteam
01:30 🔗 atlogbot has joined #urlteam
01:31 🔗 chazchaz has joined #urlteam
01:31 🔗 svchfoo3 sets mode: +o chazchaz
01:35 🔗 svchfoo1 has joined #urlteam
01:35 🔗 svchfoo3 sets mode: +o svchfoo1
01:50 🔗 bwn has joined #urlteam
02:40 🔗 aaaaaaaaa has quit IRC (Read error: Operation timed out)
03:55 🔗 bwn has quit IRC (Read error: Operation timed out)
04:15 🔗 bwn has joined #urlteam
05:00 🔗 JesseW has joined #urlteam
05:00 🔗 svchfoo1 sets mode: +o JesseW
05:54 🔗 dashcloud has quit IRC (Read error: Operation timed out)
05:58 🔗 dashcloud has joined #urlteam
05:59 🔗 svchfoo3 sets mode: +o dashcloud
06:01 🔗 bwn has quit IRC (Read error: Operation timed out)
06:11 🔗 GLaDOS has quit IRC (Read error: Operation timed out)
06:22 🔗 bwn has joined #urlteam
06:41 🔗 JesseW has quit IRC (Leaving.)
06:43 🔗 GLaDOS has joined #urlteam
06:43 🔗 svchfoo3 sets mode: +o GLaDOS
08:40 🔗 WinterFox has quit IRC (Remote host closed the connection)
08:42 🔗 WinterFox has joined #urlteam
08:57 🔗 dashcloud has quit IRC (Read error: Operation timed out)
09:00 🔗 dashcloud has joined #urlteam
09:00 🔗 svchfoo3 sets mode: +o dashcloud
10:27 🔗 VADemon has quit IRC (left4dead)
11:27 🔗 dashcloud has quit IRC (Read error: Operation timed out)
11:31 🔗 dashcloud has joined #urlteam
11:31 🔗 svchfoo3 sets mode: +o dashcloud
11:42 🔗 VADemon has joined #urlteam
11:49 🔗 bwn has quit IRC (Read error: Operation timed out)
12:13 🔗 dashcloud has quit IRC (Read error: Operation timed out)
12:16 🔗 dashcloud has joined #urlteam
12:17 🔗 svchfoo3 sets mode: +o dashcloud
12:42 🔗 bwn has joined #urlteam
13:14 🔗 chazchaz has quit IRC (Read error: Operation timed out)
13:20 🔗 chazchaz has joined #urlteam
13:20 🔗 svchfoo1 sets mode: +o chazchaz
13:39 🔗 slang has quit IRC (Ping timeout: 240 seconds)
13:55 🔗 WinterFox has quit IRC (Remote host closed the connection)
15:56 🔗 JW_work has quit IRC (Read error: Connection reset by peer)
15:58 🔗 JW_work has joined #urlteam
15:58 🔗 JW_work has quit IRC (Read error: Connection reset by peer)
16:02 🔗 JW_work has joined #urlteam
17:51 🔗 JesseW has joined #urlteam
17:51 🔗 svchfoo3 sets mode: +o JesseW
18:16 🔗 JesseW Once we finish the first round of migre.me, I'm going to have to go through the results, and do a second round of 1 URL items for the ones we missed in this round...
18:17 🔗 JesseW but that's only ~5,000 so far, so it shouldn't be too painful.
18:38 🔗 bwn_ has joined #urlteam
18:39 🔗 JesseW has quit IRC (Leaving.)
18:46 🔗 bwn has quit IRC (Read error: Operation timed out)
18:57 🔗 aaaaaaaaa has joined #urlteam
18:57 🔗 swebb sets mode: +o aaaaaaaaa
19:32 🔗 JesseW has joined #urlteam
19:32 🔗 svchfoo3 sets mode: +o JesseW
20:44 🔗 VADemon has quit IRC (left4dead)
20:52 🔗 WinterFox has joined #urlteam
21:46 🔗 bwn_ has quit IRC (Ping timeout: 606 seconds)
21:52 🔗 WinterFox How are the urls arranged in the dumps? JSON, csv?
22:03 🔗 * JesseW goes to write this up on http://archiveteam.org/index.php?title=URLTeam
22:06 🔗 WinterFox I almost have yesterdays dump downloaded so I will check on that
22:13 🔗 bwn has joined #urlteam
22:20 🔗 JesseW WinterFox: OK, written up (thanks for prompting me to do it): http://archiveteam.org/index.php?title=URLTeam#Archives
22:20 🔗 JesseW let me know if you have questions
22:22 🔗 WinterFox Looks good.
22:23 🔗 WinterFox JesseW, Do you have a script to download all the daily dumps?
22:24 🔗 JesseW I hacked together stuff, but I don't have a script, exactly.
22:26 🔗 JesseW print '\n'.join(list(x.get_files(glob_pattern='*.torrent'))[0].url for x in list(internetarchive.api.search_items('subject:urlteam').iter_as_items()))
22:27 🔗 JesseW should get you a list of URLs of the torrents for all the daily dumps, which you can then push into transmission (or another bt client) with xargs
22:27 🔗 WinterFox Thats python right?
22:27 🔗 JesseW yep. Python using the internetarchive library
22:28 🔗 JesseW It's basically just making a call to the IA's Advanced Search for things with a subject of urlteam, then grabbing the files with a torrent extension, and printing out the download URLs for them.
22:29 🔗 JesseW One could pretty easily rewrite it in shell, with curl, etc.
22:30 🔗 WinterFox AttributeError: 'Search' object has no attribute 'iter_as_items'
22:32 🔗 JesseW ah, sorry, you'll need to use the 1.0 branch
22:32 🔗 JesseW we're planning on merging them soon, but havne't done it yet
22:32 🔗 JesseW want more tests first
22:33 🔗 WinterFox So I need a newer version of the internetarchive lib?
22:33 🔗 WinterFox Can I do that with pip?
22:34 🔗 JesseW or, as iter_as_items is just a convienence, you could convert the search results to items yourself, by pulling out the identifier and stuffing it in internetarchive.api.get_item
22:35 🔗 JesseW e.g. print '\n'.join(list(iaapi.get_item(x['identifier']).get_files(glob_pattern='*.torrent'))[0].url for x in iaapi.search_items('subject:urlteam AND addeddate:[2015-11-05 TO 2016]'))
22:35 🔗 JesseW do import internetarchive.api as iaapi
22:35 🔗 JesseW first
22:35 🔗 JesseW or change the references to iaapi to use the longer name. :-)
22:37 🔗 WinterFox So I just change 2015-11-05 to 2013 to get them all?
22:37 🔗 JesseW also, the code above only gets the last few -- remove the AND addeddate part to get all of them
22:37 🔗 WinterFox ah
22:37 🔗 JesseW I was just using that to get a list of them to add to my collection
22:43 🔗 WinterFox It seems to be working
22:51 🔗 JesseW cool
23:20 🔗 JesseW has quit IRC (Leaving.)

irclogger-viewer