[00:15] heh [00:15] running ffgrab.rb on AT&T's 3G network is an exercise in omg [00:19] eek [01:21] /home/ben/.rvm/gems/ruby-1.9.3-p0/gems/girl_friday-0.9.7/lib/girl_friday/actor.rb:212:in `block in initialize' [01:21] /home/ben/.rvm/gems/ruby-1.9.3-p0/gems/girl_friday-0.9.7/lib/girl_friday/actor.rb:212:in `initialize' [01:21] /home/ben/.rvm/gems/ruby-1.9.3-p0/gems/girl_friday-0.9.7/lib/girl_friday/actor.rb:377:in `watchdog' [01:21] /home/ben/.rvm/gems/ruby-1.9.3-p0/gems/girl_friday-0.9.7/lib/girl_friday/actor.rb:66:in `new' [01:21] yipdw im getting a whole mess of these/home/ben/.rvm/gems/ruby-1.9.3-p0/gems/girl_friday-0.9.7/lib/girl_friday/actor.rb:69:in `block (2 levels) in spawn' [01:21] /home/ben/.rvm/gems/ruby-1.9.3-p0/gems/girl_friday-0.9.7/lib/girl_friday/actor.rb:66:in `block in spawn' [01:21] E, [2011-12-09T20:21:18.680575 #18455] ERROR -- : Exception Errno::ECONNREFUSED (Connection refused - Unable to connect to Redis on 127.0.0.1:6379) raised while scraping /cartoon/; requeuing. [01:26] yeah [01:26] I said you need a Redis instance [01:31] i figured out how to run reedis, now where is it saving these links, to, or isnt it? [01:32] all data is being saved in the Redis database [01:32] the list of story IDs is present in the "stories" key [01:32] all other keys are discovery state [01:32] which is where? [01:33] the stories key is in the Redis database [01:33] redis-cli is the Redis CLI interface [01:33] to query, launch it and run e.g. scard stories, smembers stories [01:33] more info available at http://redis.io [01:34] also, you should pull again, because I fixed a problem with usage of the If-Modified-Since header [01:34] wow, 5k stories in 45sec thats fast [01:34] it's not actually pulling story data [01:35] but, yeah, it's decently quick [01:35] it can be faster without the random waits, but [01:35] i know its just saving valid ids right [01:35] I feel bad about taking that out [01:35] if you run it on JRuby 1.6+, it is possible to scale it to a very high number of threads [01:36] do i need to run anything else before running ffgrab agian after apull [01:36] but again, I don't feel good about doing that, because I don't know fanfiction.net's capabilities [01:36] no, terminate it [01:36] i did [01:36] ok [01:36] then just re-run it [01:36] and i git pull, and now im running again [01:36] the scraper will pick up from where it left off [01:37] anyway, brb [01:37] well bbl more like, heh [01:37] thanks bye [06:55] http://retropc.net/ [06:57] the japanese counterpart to SketchCow? [07:01] man... why didn't I buy another HR-S9911U or two before they disappeared? [07:38] the japanese guys are like negative sketchcow, they collect shitloads of stuff and then never digitize any of it ever [07:40] there are a few exceptions [08:47] not sure if there was a link from the main page of the site, so: http://retropc.net/alice/ [09:49] well this is awesome [09:49] 9) "/Wrestling_and_CSI_Miami_Crossovers/230/1686/_cache_control" [09:50] woooo [09:50] the best [09:50] O_O [09:51] fan fiction gets weeeeeird [09:51] 47) "/Frasier_and_Megami_Tensei_Crossovers/381/1074/_cache_control" [09:51] i wonder how many ff.net stories have self-insertion [09:51] fan fiction starts weird, remains weird, and ends weird. [09:51] is _cache_control some sort of strange fanfic thing? [09:52] no, that's my Cache-Control observance mechanism [09:52] :P [09:52] it's kind of hacky [09:52] but it works [09:54] argh, what is the linux tool to compare two textfiles where you can specify eg "only show entries that appear in a but not b". not diff, something simpler [09:55] grep -v -f b a [09:56] works best when B is small and A is large [09:56] it was something with the flags -1 , -2, -3 [09:56] but that sounds good too :) [09:56] it's not quite there, but that's where I would start. [09:57] you also want to tell grep to match whole lines only (dunno the option), and interpret the lines as fixed strings rather than patterns (-F, I think) [10:00] chronomex: comm it is :) [10:00] ah [10:08] ah this alice soft archive is nice [13:36] http://dtrace.org/blogs/brendan/2011/12/08/2000x-performance-win/ [21:05] Schbirid: I bet it would have really flown with fgrep. [21:58] gsc-game.com wget-warc appears to be done, 1.5gb total, including cdx and warc fle [21:58] 44.185 items, at 1.8gb [23:14] Yes, once again, archiveteam's site has spam on it. [23:18] You know, for being all twitchy about the fact I've been porting their digitized magazines to archive.org, this site has done a spectacularly shitty job getting consistent spams. [23:18] scans. [23:19] Some of these are literally mish-mashes where issues 1 2 3 8 and 9 are collections of JPG files, and then 4 5 6 are pdfs and then 7 is a set of jpg files in two file directories. [23:47] SketchCow: hah [23:58] scanning is boring, let's get high first