#archiveteam-bs 2013-03-11,Mon

↑back Search

Time Nickname Message
02:16 🔗 shaqfu If anyone's interested, I put together a script to recursively scan+crawl SWFs
02:16 🔗 shaqfu No WARC support, sadly, and the output kinda sucks atm
02:17 🔗 shaqfu But I'll toss it on Gist or w/e if anyone wants it
02:20 🔗 tef do it
02:21 🔗 tef ;
02:24 🔗 shaqfu https://gist.github.com/anonymous/5131513
02:24 🔗 shaqfu I'd like to somehow fork the output of wget and the script sanely
02:29 🔗 tef hrm
02:30 🔗 tef well really what you want to do is to take wget and add swfmill support :-)
02:30 🔗 shaqfu tef: What I'd love to do in the future is some sort of Python wrapper or w/e that would add support for SWF and basic JS
02:30 🔗 shaqfu Even if it's just pattern-matching openBrWindow, window.open, etc
02:31 🔗 shaqfu But even then, having an automated SWF crawler may be useful, even if you have to find . | xargs wget --warc later
02:36 🔗 tef heh
02:37 🔗 tef i do have a python crawler which should have feature parity in terms of url extraction from html kicking around in my github
02:37 🔗 tef but it weites ugly warcs because it uses an old version of requests
02:43 🔗 shaqfu Wonder if it's sane to write a wget wrapper to add functionality
02:43 🔗 shaqfu Might be better to use what works to start :)
03:00 🔗 gnathan_ hi, sorry for perhaps an odd request: trying to get my hands on some case law but it's turning out to be basically impossible without being on an academic network. Wondered if anybody could help me out? really grateful if you could PM me. thanks!
03:17 🔗 closure http://brianbailey.me/the-world-wide-web-is-moving-to-aol
03:26 🔗 dashcloud gnathan_: not sure if you've already checked the recap site? they have a large number of items from PACER
03:27 🔗 gnathan_ dashcloud: ah, I had not, but it looks like PACER is US. I'm looking for UK document
03:27 🔗 gnathan_ s
03:27 🔗 gnathan_ thanks, though :D
03:31 🔗 dashcloud welcome
03:32 🔗 dashcloud good night!
03:37 🔗 yipdw I wish git rm had a --fuckyou option
03:37 🔗 yipdw to be used when you see huge directories that have no references to the main application modules at all
04:19 🔗 omf_ A while back someone mentioned that the IA makes cdx files from warcs if they do not exist. If I upload the warc and cdx together will I get the same cdx file I uploaded back later or the IA generated one?
04:20 🔗 omf_ Should there be any differences between those two cdx files?
04:21 🔗 tef depends. if you use index-cdx, yes, but it might be doing some different url canonicalization
04:34 🔗 SketchCo1 You SHOULD get the generated CDX
04:34 🔗 SketchCo1 But why do thast
04:36 🔗 godane i always make cdx files with my warc.gz
05:03 🔗 SketchCow > x-archive-meta-language:eng
05:03 🔗 SketchCow > x-archive-meta-title:Computer Shopper (January 2002)
05:03 🔗 SketchCow > Content-Length: 10269704051
05:04 🔗 SketchCow Now that's a lotta content
05:21 🔗 S[h]O[r]T http://www.stuff.co.nz/technology/digital-living/8410469/Eftpos-prank-unnoticed-for-weeks
05:35 🔗 godane uploaded: https://archive.org/details/www.g4tv.com-aots-blog-20130309
05:54 🔗 godane uploaded: https://archive.org/details/www.g4tv.com-video-pages-index-20130308
13:31 🔗 mistym http://forums.lostlevels.org/viewtopic.php?p=32646
13:33 🔗 mistym Dog version of Virus (aka Dr. Mario) dumped
13:53 🔗 godane Steal Princess Japanese Trailer: http://archive.org/details/g4tv.com-video36197
13:54 🔗 omf_ I love seeing stuff like this get out onto the net
13:54 🔗 omf_ prototypes of games are fun
14:04 🔗 Smiley Do we have a commandline warc-proxy/warc-browsing solution?
14:16 🔗 closure http://xrl.us/bonrhi ... well hello xrl.us, is that an url in your pocket, or are you happy to see me?
14:18 🔗 ersi mr metamark
14:46 🔗 ersi holy shit, Skype knows how to sed
14:48 🔗 mistym I love that.
14:48 🔗 mistym I wish it supported actual regexps though
14:50 🔗 Smiley o_O how/
14:50 🔗 ersi type something. Like "Hi, I'm a horse"
14:50 🔗 Smiley nod.
14:50 🔗 ersi then s/horse/cake/
14:51 🔗 ersi bam, edited previous message
14:51 🔗 Smiley on the next line?
14:51 🔗 Smiley yahy not working here :<
14:51 🔗 ersi did you send it?
14:52 🔗 Smiley yup
14:52 🔗 Smiley forgot the last slash.
14:56 🔗 ersi ah ;)
14:56 🔗 ersi so it worked?
15:01 🔗 Smiley yup
15:11 🔗 omf_ I am currently downloading 10 files per second across all the crawls I am running
15:33 🔗 godane uploaded: http://archive.org/details/g4tv.com-hdvideo-xml-20130228
15:52 🔗 mistym Protip: expose your SOLR database to the outside world http://search.comedycentral.com/solr/ http://search.comedycentral.com/solr/comedycentral/select?q=*:*
16:00 🔗 omf_ google just updated their CSE admin interface
16:00 🔗 omf_ it sucks really hard now
16:07 🔗 omf_ Besides making me click more to get the same tasks done, the only thing they did was make it prettier
16:17 🔗 chronomex CSE admin?
16:17 🔗 omf_ the google admin interface to custom search engines
16:17 🔗 chronomex ah
16:17 🔗 omf_ it now looks like g+
16:18 🔗 chronomex 0 surprise there
16:18 🔗 omf_ also I would get 20 results at a time before
16:18 🔗 omf_ now I only get 10 :(
16:21 🔗 godane uploaded: http://archive.org/details/g4tv.com-missingvideos-20130226
16:21 🔗 godane its the wii service xml data dump
16:29 🔗 omf_ I am going to step away from this problem for a few hours and see if I can get some new perspective on current scraping issues
16:42 🔗 mistym Anyone have an idea of what could be happening to this poor guy's computer? All I can think to suggest to them is "dying hard drive". https://gist.github.com/vaLinBSD/5090366
16:44 🔗 chronomex "pathname contains \0" implies it's on hfs+ or some similar fs
16:44 🔗 chronomex oh, um
16:44 🔗 DFJustin corrupt file system certainly, has he tried fscking
16:44 🔗 chronomex yea
16:45 🔗 chronomex once I had bad ram lead to all *kinds* of weird corruption
16:45 🔗 DFJustin yeah was gonna say run memtest86 next
16:46 🔗 mistym I'll suggest that, thanks.
16:52 🔗 godane so looks like my image dumps are in 5gb+ area near the end
16:52 🔗 godane of ids
16:52 🔗 godane i hope its only up to 317k or 318k ids
17:25 🔗 godane hey shaqfu
17:29 🔗 godane i'm uploading my g4tv.com/games/ dump
17:57 🔗 godane uploaded: http://archive.org/details/www.g4tv.com-games-20130311
18:50 🔗 godane you guys should see about mirroring cscope
23:01 🔗 S[h]O[r]T http://gawker.com/5978068/lost-unaired-episode-of-dexters-laboratory-finally-comes-out-of-hiding
23:01 🔗 S[h]O[r]T except now the video is private lol. but im sure plenty of people downloaded it
23:17 🔗 GLaDOS http://scr.glados.me/1363043617.jpg What such good timing for a marketing email.
23:17 🔗 GLaDOS (lower right screen, gmail notification)
