[02:16] If anyone's interested, I put together a script to recursively scan+crawl SWFs [02:16] No WARC support, sadly, and the output kinda sucks atm [02:17] But I'll toss it on Gist or w/e if anyone wants it [02:20] do it [02:21] ; [02:24] https://gist.github.com/anonymous/5131513 [02:24] I'd like to somehow fork the output of wget and the script sanely [02:29] hrm [02:30] well really what you want to do is to take wget and add swfmill support :-) [02:30] tef: What I'd love to do in the future is some sort of Python wrapper or w/e that would add support for SWF and basic JS [02:30] Even if it's just pattern-matching openBrWindow, window.open, etc [02:31] But even then, having an automated SWF crawler may be useful, even if you have to find . | xargs wget --warc later [02:36] heh [02:37] i do have a python crawler which should have feature parity in terms of url extraction from html kicking around in my github [02:37] but it weites ugly warcs because it uses an old version of requests [02:43] Wonder if it's sane to write a wget wrapper to add functionality [02:43] Might be better to use what works to start :) [03:00] hi, sorry for perhaps an odd request: trying to get my hands on some case law but it's turning out to be basically impossible without being on an academic network. Wondered if anybody could help me out? really grateful if you could PM me. thanks! [03:17] http://brianbailey.me/the-world-wide-web-is-moving-to-aol [03:26] gnathan_: not sure if you've already checked the recap site? they have a large number of items from PACER [03:27] dashcloud: ah, I had not, but it looks like PACER is US. I'm looking for UK document [03:27] s [03:27] thanks, though :D [03:31] welcome [03:32] good night! [03:37] I wish git rm had a --fuckyou option [03:37] to be used when you see huge directories that have no references to the main application modules at all [04:19] A while back someone mentioned that the IA makes cdx files from warcs if they do not exist. If I upload the warc and cdx together will I get the same cdx file I uploaded back later or the IA generated one? [04:20] Should there be any differences between those two cdx files? [04:21] depends. if you use index-cdx, yes, but it might be doing some different url canonicalization [04:34] You SHOULD get the generated CDX [04:34] But why do thast [04:36] i always make cdx files with my warc.gz [05:03] > x-archive-meta-language:eng [05:03] > x-archive-meta-title:Computer Shopper (January 2002) [05:03] > Content-Length: 10269704051 [05:04] Now that's a lotta content [05:21] http://www.stuff.co.nz/technology/digital-living/8410469/Eftpos-prank-unnoticed-for-weeks [05:35] uploaded: https://archive.org/details/www.g4tv.com-aots-blog-20130309 [05:54] uploaded: https://archive.org/details/www.g4tv.com-video-pages-index-20130308 [13:31] http://forums.lostlevels.org/viewtopic.php?p=32646 [13:33] Dog version of Virus (aka Dr. Mario) dumped [13:53] Steal Princess Japanese Trailer: http://archive.org/details/g4tv.com-video36197 [13:54] I love seeing stuff like this get out onto the net [13:54] prototypes of games are fun [14:04] Do we have a commandline warc-proxy/warc-browsing solution? [14:16] http://xrl.us/bonrhi ... well hello xrl.us, is that an url in your pocket, or are you happy to see me? [14:18] mr metamark [14:46] holy shit, Skype knows how to sed [14:48] I love that. [14:48] I wish it supported actual regexps though [14:50] o_O how/ [14:50] type something. Like "Hi, I'm a horse" [14:50] nod. [14:50] then s/horse/cake/ [14:51] bam, edited previous message [14:51] on the next line? [14:51] yahy not working here :< [14:51] did you send it? [14:52] yup [14:52] forgot the last slash. [14:56] ah ;) [14:56] so it worked? [15:01] yup [15:11] I am currently downloading 10 files per second across all the crawls I am running [15:33] uploaded: http://archive.org/details/g4tv.com-hdvideo-xml-20130228 [15:52] Protip: expose your SOLR database to the outside world http://search.comedycentral.com/solr/ http://search.comedycentral.com/solr/comedycentral/select?q=*:* [16:00] google just updated their CSE admin interface [16:00] it sucks really hard now [16:07] Besides making me click more to get the same tasks done, the only thing they did was make it prettier [16:17] CSE admin? [16:17] the google admin interface to custom search engines [16:17] ah [16:17] it now looks like g+ [16:18] 0 surprise there [16:18] also I would get 20 results at a time before [16:18] now I only get 10 :( [16:21] uploaded: http://archive.org/details/g4tv.com-missingvideos-20130226 [16:21] its the wii service xml data dump [16:29] I am going to step away from this problem for a few hours and see if I can get some new perspective on current scraping issues [16:42] Anyone have an idea of what could be happening to this poor guy's computer? All I can think to suggest to them is "dying hard drive". https://gist.github.com/vaLinBSD/5090366 [16:44] "pathname contains \0" implies it's on hfs+ or some similar fs [16:44] oh, um [16:44] corrupt file system certainly, has he tried fscking [16:44] yea [16:45] once I had bad ram lead to all *kinds* of weird corruption [16:45] yeah was gonna say run memtest86 next [16:46] I'll suggest that, thanks. [16:52] so looks like my image dumps are in 5gb+ area near the end [16:52] of ids [16:52] i hope its only up to 317k or 318k ids [17:25] hey shaqfu [17:29] i'm uploading my g4tv.com/games/ dump [17:57] uploaded: http://archive.org/details/www.g4tv.com-games-20130311 [18:50] you guys should see about mirroring cscope [23:01] http://gawker.com/5978068/lost-unaired-episode-of-dexters-laboratory-finally-comes-out-of-hiding [23:01] except now the video is private lol. but im sure plenty of people downloaded it [23:17] http://scr.glados.me/1363043617.jpg What such good timing for a marketing email. [23:17] (lower right screen, gmail notification)