[00:12] *** bwn has quit IRC (Read error: Operation timed out)
[00:24] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
[00:25] *** aaaaaaaaa has joined #urlteam
[00:25] *** swebb sets mode: +o aaaaaaaaa
[01:14] *** JesseW has joined #urlteam
[01:15] *** svchfoo3 sets mode: +o JesseW
[01:20] *** xmc has quit IRC (Read error: Operation timed out)
[01:21] <JesseW> OK, here's the 20,141 unique URLs containing "adrive.com" in the URLteam corpus: https://gist.github.com/JesseWeinstein/f1287df2ca12b1d96705/raw/3fb99fcea51fa990ded6e67c642e7a7c69a08aa3/gistfile1.txt
[01:21] *** svchfoo1 has quit IRC (Read error: Operation timed out)
[01:23] *** xmc has joined #urlteam
[01:23] *** swebb sets mode: +o xmc
[01:23] *** Fusl has quit IRC (Ping timeout: 255 seconds)
[01:23] <JesseW> I need to figure out a better way to handle migre.me -- for now, I've turned it off.
[01:23] *** svchfoo3 has quit IRC (Ping timeout: 369 seconds)
[01:25] *** svchfoo3 has joined #urlteam
[01:25] *** chazchaz has quit IRC (Ping timeout: 369 seconds)
[01:27] *** phuzion has quit IRC (Ping timeout: 369 seconds)
[01:27] *** atlogbot has quit IRC (Ping timeout: 369 seconds)
[01:27] *** phuzion has joined #urlteam
[01:27] *** JesseW has quit IRC (Leaving.)
[01:28] *** aaaaaaaaa sets mode: +o svchfoo3
[01:29] *** Fusl has joined #urlteam
[01:30] *** atlogbot has joined #urlteam
[01:31] *** chazchaz has joined #urlteam
[01:31] *** svchfoo3 sets mode: +o chazchaz
[01:35] *** svchfoo1 has joined #urlteam
[01:35] *** svchfoo3 sets mode: +o svchfoo1
[01:50] *** bwn has joined #urlteam
[02:40] *** aaaaaaaaa has quit IRC (Read error: Operation timed out)
[03:55] *** bwn has quit IRC (Read error: Operation timed out)
[04:15] *** bwn has joined #urlteam
[05:00] *** JesseW has joined #urlteam
[05:00] *** svchfoo1 sets mode: +o JesseW
[05:54] *** dashcloud has quit IRC (Read error: Operation timed out)
[05:58] *** dashcloud has joined #urlteam
[05:59] *** svchfoo3 sets mode: +o dashcloud
[06:01] *** bwn has quit IRC (Read error: Operation timed out)
[06:11] *** GLaDOS has quit IRC (Read error: Operation timed out)
[06:22] *** bwn has joined #urlteam
[06:41] *** JesseW has quit IRC (Leaving.)
[06:43] *** GLaDOS has joined #urlteam
[06:43] *** svchfoo3 sets mode: +o GLaDOS
[08:40] *** WinterFox has quit IRC (Remote host closed the connection)
[08:42] *** WinterFox has joined #urlteam
[08:57] *** dashcloud has quit IRC (Read error: Operation timed out)
[09:00] *** dashcloud has joined #urlteam
[09:00] *** svchfoo3 sets mode: +o dashcloud
[10:27] *** VADemon has quit IRC (left4dead)
[11:27] *** dashcloud has quit IRC (Read error: Operation timed out)
[11:31] *** dashcloud has joined #urlteam
[11:31] *** svchfoo3 sets mode: +o dashcloud
[11:42] *** VADemon has joined #urlteam
[11:49] *** bwn has quit IRC (Read error: Operation timed out)
[12:13] *** dashcloud has quit IRC (Read error: Operation timed out)
[12:16] *** dashcloud has joined #urlteam
[12:17] *** svchfoo3 sets mode: +o dashcloud
[12:42] *** bwn has joined #urlteam
[13:14] *** chazchaz has quit IRC (Read error: Operation timed out)
[13:20] *** chazchaz has joined #urlteam
[13:20] *** svchfoo1 sets mode: +o chazchaz
[13:39] *** slang has quit IRC (Ping timeout: 240 seconds)
[13:55] *** WinterFox has quit IRC (Remote host closed the connection)
[15:56] *** JW_work has quit IRC (Read error: Connection reset by peer)
[15:58] *** JW_work has joined #urlteam
[15:58] *** JW_work has quit IRC (Read error: Connection reset by peer)
[16:02] *** JW_work has joined #urlteam
[17:51] *** JesseW has joined #urlteam
[17:51] *** svchfoo3 sets mode: +o JesseW
[18:16] <JesseW> Once we finish the first round of migre.me, I'm going to have to go through the results, and do a second round of 1 URL items for the ones we missed in this round...
[18:17] <JesseW> but that's only ~5,000 so far, so it shouldn't be too painful.
[18:38] *** bwn_ has joined #urlteam
[18:39] *** JesseW has quit IRC (Leaving.)
[18:46] *** bwn has quit IRC (Read error: Operation timed out)
[18:57] *** aaaaaaaaa has joined #urlteam
[18:57] *** swebb sets mode: +o aaaaaaaaa
[19:32] *** JesseW has joined #urlteam
[19:32] *** svchfoo3 sets mode: +o JesseW
[20:44] *** VADemon has quit IRC (left4dead)
[20:52] *** WinterFox has joined #urlteam
[21:46] *** bwn_ has quit IRC (Ping timeout: 606 seconds)
[21:52] <WinterFox> How are the urls arranged in the dumps? JSON, csv?
[22:03] * JesseW goes to write this up on http://archiveteam.org/index.php?title=URLTeam
[22:06] <WinterFox> I almost have yesterdays dump downloaded so I will check on that
[22:13] *** bwn has joined #urlteam
[22:20] <JesseW> WinterFox: OK, written up (thanks for prompting me to do it): http://archiveteam.org/index.php?title=URLTeam#Archives
[22:20] <JesseW> let me know if you have questions
[22:22] <WinterFox> Looks good. 
[22:23] <WinterFox> JesseW, Do you have a script to download all the daily dumps?
[22:24] <JesseW> I hacked together stuff, but I don't have a script, exactly.
[22:26] <JesseW> print '\n'.join(list(x.get_files(glob_pattern='*.torrent'))[0].url for x in list(internetarchive.api.search_items('subject:urlteam').iter_as_items()))
[22:27] <JesseW> should get you a list of URLs of the torrents for all the daily dumps, which you can then push into transmission (or another bt client) with xargs 
[22:27] <WinterFox> Thats python right?
[22:27] <JesseW> yep. Python using the internetarchive library
[22:28] <JesseW> It's basically just making a call to the IA's Advanced Search for things with a subject of urlteam, then grabbing the files with a torrent extension, and printing out the download URLs for them.
[22:29] <JesseW> One could pretty easily rewrite it in shell, with curl, etc.
[22:30] <WinterFox> AttributeError: 'Search' object has no attribute 'iter_as_items'
[22:32] <JesseW> ah, sorry, you'll need to use the 1.0 branch
[22:32] <JesseW> we're planning on merging them soon, but havne't done it yet
[22:32] <JesseW> want more tests first
[22:33] <WinterFox> So I need a newer version of the internetarchive lib?
[22:33] <WinterFox> Can I do that with pip?
[22:34] <JesseW> or, as iter_as_items is just a convienence, you could convert the search results to items yourself, by pulling out the identifier and stuffing it in internetarchive.api.get_item
[22:35] <JesseW> e.g. print '\n'.join(list(iaapi.get_item(x['identifier']).get_files(glob_pattern='*.torrent'))[0].url for x in iaapi.search_items('subject:urlteam AND addeddate:[2015-11-05 TO 2016]'))
[22:35] <JesseW> do import internetarchive.api as iaapi 
[22:35] <JesseW> first
[22:35] <JesseW> or change the references to iaapi to use the longer name. :-)
[22:37] <WinterFox> So I just change 2015-11-05 to 2013 to get them all?
[22:37] <JesseW> also, the code above only gets the last few -- remove the AND addeddate part to get all of them
[22:37] <WinterFox> ah
[22:37] <JesseW> I was just using that to get a list of them to add to my collection
[22:43] <WinterFox> It seems to be working
[22:51] <JesseW> cool
[23:20] *** JesseW has quit IRC (Leaving.)