[00:12] *** bwn has quit IRC (Read error: Operation timed out) [00:24] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [00:25] *** aaaaaaaaa has joined #urlteam [00:25] *** swebb sets mode: +o aaaaaaaaa [01:14] *** JesseW has joined #urlteam [01:15] *** svchfoo3 sets mode: +o JesseW [01:20] *** xmc has quit IRC (Read error: Operation timed out) [01:21] OK, here's the 20,141 unique URLs containing "adrive.com" in the URLteam corpus: https://gist.github.com/JesseWeinstein/f1287df2ca12b1d96705/raw/3fb99fcea51fa990ded6e67c642e7a7c69a08aa3/gistfile1.txt [01:21] *** svchfoo1 has quit IRC (Read error: Operation timed out) [01:23] *** xmc has joined #urlteam [01:23] *** swebb sets mode: +o xmc [01:23] *** Fusl has quit IRC (Ping timeout: 255 seconds) [01:23] I need to figure out a better way to handle migre.me -- for now, I've turned it off. [01:23] *** svchfoo3 has quit IRC (Ping timeout: 369 seconds) [01:25] *** svchfoo3 has joined #urlteam [01:25] *** chazchaz has quit IRC (Ping timeout: 369 seconds) [01:27] *** phuzion has quit IRC (Ping timeout: 369 seconds) [01:27] *** atlogbot has quit IRC (Ping timeout: 369 seconds) [01:27] *** phuzion has joined #urlteam [01:27] *** JesseW has quit IRC (Leaving.) [01:28] *** aaaaaaaaa sets mode: +o svchfoo3 [01:29] *** Fusl has joined #urlteam [01:30] *** atlogbot has joined #urlteam [01:31] *** chazchaz has joined #urlteam [01:31] *** svchfoo3 sets mode: +o chazchaz [01:35] *** svchfoo1 has joined #urlteam [01:35] *** svchfoo3 sets mode: +o svchfoo1 [01:50] *** bwn has joined #urlteam [02:40] *** aaaaaaaaa has quit IRC (Read error: Operation timed out) [03:55] *** bwn has quit IRC (Read error: Operation timed out) [04:15] *** bwn has joined #urlteam [05:00] *** JesseW has joined #urlteam [05:00] *** svchfoo1 sets mode: +o JesseW [05:54] *** dashcloud has quit IRC (Read error: Operation timed out) [05:58] *** dashcloud has joined #urlteam [05:59] *** svchfoo3 sets mode: +o dashcloud [06:01] *** bwn has quit IRC (Read error: Operation timed out) [06:11] *** GLaDOS has quit IRC (Read error: Operation timed out) [06:22] *** bwn has joined #urlteam [06:41] *** JesseW has quit IRC (Leaving.) [06:43] *** GLaDOS has joined #urlteam [06:43] *** svchfoo3 sets mode: +o GLaDOS [08:40] *** WinterFox has quit IRC (Remote host closed the connection) [08:42] *** WinterFox has joined #urlteam [08:57] *** dashcloud has quit IRC (Read error: Operation timed out) [09:00] *** dashcloud has joined #urlteam [09:00] *** svchfoo3 sets mode: +o dashcloud [10:27] *** VADemon has quit IRC (left4dead) [11:27] *** dashcloud has quit IRC (Read error: Operation timed out) [11:31] *** dashcloud has joined #urlteam [11:31] *** svchfoo3 sets mode: +o dashcloud [11:42] *** VADemon has joined #urlteam [11:49] *** bwn has quit IRC (Read error: Operation timed out) [12:13] *** dashcloud has quit IRC (Read error: Operation timed out) [12:16] *** dashcloud has joined #urlteam [12:17] *** svchfoo3 sets mode: +o dashcloud [12:42] *** bwn has joined #urlteam [13:14] *** chazchaz has quit IRC (Read error: Operation timed out) [13:20] *** chazchaz has joined #urlteam [13:20] *** svchfoo1 sets mode: +o chazchaz [13:39] *** slang has quit IRC (Ping timeout: 240 seconds) [13:55] *** WinterFox has quit IRC (Remote host closed the connection) [15:56] *** JW_work has quit IRC (Read error: Connection reset by peer) [15:58] *** JW_work has joined #urlteam [15:58] *** JW_work has quit IRC (Read error: Connection reset by peer) [16:02] *** JW_work has joined #urlteam [17:51] *** JesseW has joined #urlteam [17:51] *** svchfoo3 sets mode: +o JesseW [18:16] Once we finish the first round of migre.me, I'm going to have to go through the results, and do a second round of 1 URL items for the ones we missed in this round... [18:17] but that's only ~5,000 so far, so it shouldn't be too painful. [18:38] *** bwn_ has joined #urlteam [18:39] *** JesseW has quit IRC (Leaving.) [18:46] *** bwn has quit IRC (Read error: Operation timed out) [18:57] *** aaaaaaaaa has joined #urlteam [18:57] *** swebb sets mode: +o aaaaaaaaa [19:32] *** JesseW has joined #urlteam [19:32] *** svchfoo3 sets mode: +o JesseW [20:44] *** VADemon has quit IRC (left4dead) [20:52] *** WinterFox has joined #urlteam [21:46] *** bwn_ has quit IRC (Ping timeout: 606 seconds) [21:52] How are the urls arranged in the dumps? JSON, csv? [22:03] * JesseW goes to write this up on http://archiveteam.org/index.php?title=URLTeam [22:06] I almost have yesterdays dump downloaded so I will check on that [22:13] *** bwn has joined #urlteam [22:20] WinterFox: OK, written up (thanks for prompting me to do it): http://archiveteam.org/index.php?title=URLTeam#Archives [22:20] let me know if you have questions [22:22] Looks good. [22:23] JesseW, Do you have a script to download all the daily dumps? [22:24] I hacked together stuff, but I don't have a script, exactly. [22:26] print '\n'.join(list(x.get_files(glob_pattern='*.torrent'))[0].url for x in list(internetarchive.api.search_items('subject:urlteam').iter_as_items())) [22:27] should get you a list of URLs of the torrents for all the daily dumps, which you can then push into transmission (or another bt client) with xargs [22:27] Thats python right? [22:27] yep. Python using the internetarchive library [22:28] It's basically just making a call to the IA's Advanced Search for things with a subject of urlteam, then grabbing the files with a torrent extension, and printing out the download URLs for them. [22:29] One could pretty easily rewrite it in shell, with curl, etc. [22:30] AttributeError: 'Search' object has no attribute 'iter_as_items' [22:32] ah, sorry, you'll need to use the 1.0 branch [22:32] we're planning on merging them soon, but havne't done it yet [22:32] want more tests first [22:33] So I need a newer version of the internetarchive lib? [22:33] Can I do that with pip? [22:34] or, as iter_as_items is just a convienence, you could convert the search results to items yourself, by pulling out the identifier and stuffing it in internetarchive.api.get_item [22:35] e.g. print '\n'.join(list(iaapi.get_item(x['identifier']).get_files(glob_pattern='*.torrent'))[0].url for x in iaapi.search_items('subject:urlteam AND addeddate:[2015-11-05 TO 2016]')) [22:35] do import internetarchive.api as iaapi [22:35] first [22:35] or change the references to iaapi to use the longer name. :-) [22:37] So I just change 2015-11-05 to 2013 to get them all? [22:37] also, the code above only gets the last few -- remove the AND addeddate part to get all of them [22:37] ah [22:37] I was just using that to get a list of them to add to my collection [22:43] It seems to be working [22:51] cool [23:20] *** JesseW has quit IRC (Leaving.)