[08:07] https://archive.org/post/1002569/requested-download-is-not-authorized-for-use-with-this-tracker [10:17] Ah. How stupid I am. SketchCow needs to get book_op create the torrents on Wikimediacommons* (uppercase W) items too, just that. Ideally they should be renamed (it was Calc messing up with uppercase in csv... when calligra is better). [17:58] http://variety.com/2013/biz/news/isohunt-to-shut-down-as-part-of-settlement-with-studios-1200734509/ [18:11] time to scrape all their ids? [18:11] I think it's shutting down today though [18:27] i'm grabbing isohunt.com forums [18:30] isohunt is shutting down? [18:30] what the fuck [18:30] it's been troubled for a long time [18:30] i know it linked to tons of archive.org torrents [18:33] i know but it's sudden [19:56] had an older bookmark open in my browser and refreshed it and saw this.. not sure when cue died.. http://cueup.com/ [19:56] looks like early this month http://techcrunch.com/2013/10/02/cue-greplin/ [19:58] ah apple bought it [19:59] they took down cue up adventure :( [21:21] http://torrentfreak.com/isohunt-shuts-down-after-110-million-settlement-with-the-mpaa-131017/ [21:23] Sites: 555 â¢ Trackers: 235,842 â¢ Active Torrents: 13,737,689 â¢ Files: 285.58M â¢ Size: 17,371.74 TB â¢ Peers: 52.83M [21:23] :( [21:26] I'd also love if http://publicbt.com/all.txt.bz2 worked again [21:27] There are 13,000,000 torrents. If each .torrent file is 50kb on average, then the total storage required to store all torrents would be < 700mb [21:27] actually disregard that [21:27] I mean 700gb [21:33] well it's not particularly useful to store torrents anyway [21:34] publicbt.com gave you all you needed; but now it doesn't work [21:43] well it's not just the torrent files, they also have uploader comments (i.e. metadata) [21:44] The metadata is interesting [21:59] Nemo_bis, omf_, DFJustin, etc: http://pastie.org/private/cbryvcdrxpf7dod4vfla [21:59] that will at least grab all the torrents [21:59] or nearly all anyway [22:01] their JSON search API is really restrictive :( [22:01] max 1000 results per query [22:01] I mean, I *could* write another bruteforce searcher again... [22:02] but their forum thread suggested that they monitor search request rate [22:02] useful: the numeric IDs for their torrents are the same as for the detail pagse [22:02] pages * [22:02] * joepie91 feels like this would be a good Warrior project [22:04] yeah just iterate through https://isohunt.com/torrent_details/xxxxxx/ [22:04] DFJustin: I'm intentionally iterating through the .torrents actually [22:04] instead of the details pages [22:04] I feel like a static torrent serving backend would be faster [22:04] note that they have already settled in court; who knows when they're going shut the site down [22:04] thus you can get 404d before it does any db queries [22:05] the .torrent files are mirrored everywhere already though, the unique stuff is all the "what the hell is this" text [22:05] you'd be downloading the .torrents anyway [22:05] so might as well start with those, and for the non-404s then fetch the /details/ pages [22:05] that makes sense I guess [22:05] DFJustin: many isohunt torrents are no longer in their original location [22:05] in my experience [22:05] :P [22:05] lysobit: that's why I'm optimizing for speed [22:06] do we have any awake warrior devs? [22:06] make it multithreaded [22:06] that are familiar with the pipeline stuff etc. [22:06] using python threads [22:06] lysobit: meh, might as well [22:07] stick on a dedi with a 1gbit pipe; done [22:07] and a 1tb hd [22:07] thought, easier said than done :P [22:07] fwiw even if the torrent is no longer in its original location the details page has the info hash which is all you really need [22:08] metadata would be nice to have [22:08] as infohash doesn't contain what files are in the torrent [22:09] and would be even better if you can store the name of the torrent too [22:10] yeah but there are other projects mass-downloading torrent files for infohashes and every torrent site under the sun will have the torrent fiel as well [22:11] so that stuff is not really at risk [22:16] hmm this multithreaded version actually works pretty well it seems [22:16] :P [22:17] http://pastie.org/private/agybnuru8digavvhdagt1w [22:19] not blocked yet [22:19] running 5 threads [22:19] roughly 10-15 torrents checked per second [22:20] 400 days [22:20] not fast enough [22:23] hrm [22:27] We know you love isoHunt, but you shouldn't hit us this fast. You are banned for 1200 seconds. [22:27] :( [22:30] well at least we now know that they have rate limiting in place lol [22:41] right, script has throttling now.. [22:50] sooooooooo [22:50] 5 threads was apparently also too much [22:51] http://pastie.org/private/jlaqklfwjkhznbx4bdrnrw [22:51] feel free to change the range to a subset of the current range (newest and oldest vars) and run [22:52] :) [22:52] (given absence of a warrior project) [23:38] right [23:38] 3 threads seems the max [23:38] to not get throttled