[00:25] hey [00:25] hi [01:14] well this is interesting, i tried to run bundle install for the ffgrab ruby project and got a password prompt, put in my pass, it grabbed and installed the proper gems, but i still get this Your bundle is complete! Use `bundle show [gemname]` to see where a bundled gem is installed. ben@ben-laptop:~/1432483$ ruby ffgrab.rb ffgrab.rb:1:in `require': no such file to load -- connection_pool (LoadError) from ffgrab.rb:1 [02:48] bsmith093: check which Ruby it is [02:49] also, I've got a list of IDs [02:49] as in complete crawl [02:50] holy crap already? [02:50] also now its spitting back a bunch or already claimed ot fietching it again [02:50] http://depot.ninjawedding.org/fanfiction.net_story_ids_2011-12-12.gz [02:50] do what you will with those [02:51] so yeah i was using 1.8.7 for some reason use 1.9.3 and itworks fine [02:51] not really any reason to run the crawler at this point [02:51] at least not for another month or so [02:52] k then, how do i sort the numbers? [02:52] its easier to split them up that way [02:53] yipdw: which script did you use for that? [02:54] arrith: my own [02:54] https://gist.github.com/1432483 [02:54] it's hacked together, but works [02:54] threaded crawler [02:54] yipdw: when you first start grabbing storis, re run the id checker, starting from the highest number you got last time [02:55] it doesn't work that way [02:55] it's not really possible to start from a number with that script [02:55] that thing starts at a set of known roots and traces from there [02:55] this is plenty, then [02:55] using Last-Modified headers and cache-control keys to avoid re-fetching [02:55] however I learned that ff.net's Last-Modified headers are not exactly useful wrt determining new stories [02:55] so [02:55] I don't recommend re-running this particular script [02:56] if you want to re-sync, write one that makes use of the current ID knowledge [02:56] it may be useful, however, to periodically re-run that grabber [02:56] I don't think I missed anything, but if someone discovers a bug in the fetch code that leads to link loss [02:56] it'd make sense to re-run it [02:57] anyway, to sort [02:57] zcat [that file] | sort -n [02:59] if you'd like, you can also sync a Redis DB to what I have [02:59] you will need approximately 500 MB free [02:59] (of RAM) [02:59] the crawler keeps around a lot of state, a bit like wget [02:59] how would i do that,( sync) [03:00] you'd need to establish an SSH tunnel to one of my computers and then run the slaveof command on your Redis instance [03:00] but I'm not particularly comfortable with allowing people to SSH to my personal systems [03:00] so I'll need to set up maybe an EC2 instance for that or some such [03:01] and i suck at tunnelling [03:01] anyway this is great, thanks :D [03:01] man ssh, check out the -N and -L options [03:01] I can also load up the RDB dump, I guess, but that may have compat issues [03:02] does this handle chappter issues as well? [03:02] it only grabs story IDs [03:02] my assumption is that the story grabber will handle those [03:02] chapters, that is [03:03] also, re: the Redis sharing thing, I guess I could also use Redis To Go [03:03] i still like my idea of grabbing the double arrow link and iterating backeards through the chapters ( or forwards) to grab the html pages directly and then work on those [03:03] ergh [03:04] why ergh? [03:04] that's a possibility, but another one is just clicking Next until you don't have another page [03:04] how would i automate that? [03:05] the ergh was about redis to go [03:05] Redis To Go works [03:05] I personally would not entrust it with any mission-critical data [03:05] but I also would not entrust Redis with mission-critical data :P [03:05] well, business-critical [03:05] and I mean I wouldn't make Redis the source of truth for that [03:06] the prices are a little high for a quick transfer of data [03:06] yeah [03:06] as far as automation goes, I dunno -- maybe wget has something for that built-in; I haven't memorized its options [03:07] neither have i [03:07] or you can use another program to find all chapter stops (read the options in the combobox perhaps) [03:07] and then feed that list of URLs into wget [03:07] that's what the splinder grabber does [03:07] afaik thats all javascript [03:08] no, it isn't [03:08] go to a story that has chapters and look at the chapter combobox [03:08] i check the source, it looked java-y to me [03:08] that is page markup [03:08] oy [03:12] can redis push into another redis server? [03:12] yeah, that's the basis of redis replication [03:12] as opposed to pull from [03:12] oh [03:12] no, not as far as I know [03:13] well apparently this is what i would grep the html for » the number after the id im currently on [03:14] also whatever the hell the ode for that funky arrow symbol is [03:17] doing intelligence recrawls is something i wonder if there's already 'best practices' for [03:18] i suppose that might be like a search on a graph or something [03:18] almost certainly [03:18] there are; a lot of it involves asking the server what changed and respecting its caching parameters [03:18] that's about all you can do [03:19] searching on a graph is what my crawler does, FYI [03:19] anyway, this is what I mean by looking at those combobox entries [03:20] throw this into a file or some such and run it with the first argument as a story ID: [03:20] require 'net/http' [03:20] require 'nokogiri' [03:20] (Nokogiri.HTML(Net::HTTP.get(URI.parse("http://www.fanfiction.net/s/#{ARGV[0]}")))/'select[name="chapter"]')[0].children.map(&:text).tap { |x| puts x.join("\n") } [03:20] on 5909536, for example, it'll print out the names of all 9 chapters [03:21] and that helps, how? [03:21] getting the chapter numbers from there involves replacing map(&:text) with map { |c| c.attr('value').text } or something along those lines [03:21] the brain [03:21] use it [03:22] oh ok, i just thought for like 5 sec, and yeah, thats makes link generation much easier ;P [03:22] alternatively, since chapters always go from 1 to a highest point, you can just take the number of entries in that list and count from 1 up [03:26] well afaik searching graphs means things like A* and fancy stuff i don't know yet [03:30] nah, doesn't have to [03:31] consider stories and categories as vertices and (category, category) and (category, story) tuples as arcs [03:31] if you start at a known set of sources and go in a breadth-first manner (like my crawler) [03:31] that's breadth-first search [03:32] A* search is about minimizing the cost of a path between two vertices [03:32] so you deal with a graph that has weighted edges [03:33] I can't think of a beneficial use of weighting in this particular scenario, though, so right now I don't think that A* would offer any gains [03:33] could be wrong though [03:36] where you tend to find A* search, alpha-beta pruning, minimax, etc is in code that needs to make decisions [03:37] AI game-playing agents is one example [03:37] crawlers, I guess, could use it, if they had some knowledge of network topology ("hit up these first because they're closer", etc) [03:37] but I don't know how much of a gain that'd make vs. the added complexity [03:37] ask the authors of Heritrix [05:10] ah Heritrix looks interesting [05:16] whee [05:16] redis' make test fails [05:16] Coderjoe: lie [05:16] *lies [05:16] "::redis::redis_read_reply $fd" [05:16] (procedure "::redis::__dispatch__" line 23) [05:16] Bad protocol, as reply type byte [05:16] [exception]: Executing test client: Bad protocol, as reply type byte. [05:16] while executing [05:16] invoked from within [05:17] sounds like a client issue [05:17] what are you using? [05:17] redis's "make test" [05:17] ah gotcha [05:18] strait from the repo? [05:18] from the download page [05:18] the 2.4.4 tarball from the download page [05:19] testing.. [05:19] all good here :) [05:20] that was rather fast [05:20] speedy machine :) [05:20] yeah... my load averave jumps up to 14 during some of these tests [05:20] all good here [05:21] Coderjoe: https://gist.github.com/5e482330834dba9594a8 [05:21] ugh, they should remove color if it's not in a tty [05:21] also [05:21] $ time make test [05:21] real 0m18.342s [05:21] user 0m0.040s [05:22] make test 12.12s user 3.01s system 88% cpu 17.148 total [05:22] foiled [05:22] my fans definately kicked in [05:22] * Coderjoe deletes the directory and starts fresh [05:22] Coderjoe: what system are you on? [05:23] (os) [05:23] and the thing is building a hell of a lot faster this time (for the main compiliation) [05:23] ubuntu 11.10 [05:33] and now running tests again [05:50] and it puked again, on the same test I think [05:50] (file "tests/integration/replication-3.tcl" line 1) [06:56] Coderjoe: weird [07:23] http://techcrunch.com/2011/12/12/thoora-shuts-down/ [07:24] is the whole internet shutting down? [07:32] seems like it's just the parts of the Internet that nobody really needs [07:57] Just 3 months after its public launch following 2 years of private beta, [07:57] WELL THERE'S YER PROBLEM [07:57] Just 3 months after its public launch following 2 years of private beta, content discovery engine Thoora today announced that it will shut down. [07:57] reposting that whole sentence [08:06] I don't know what I did, but starting with a fresh instance seems to have corrected it [08:07] all tests passed [09:10] http://batcave.textfiles.com/ocrcount [09:10] ha ha, take that archive.org [09:10] \o/ [09:11] oh my god this is a little disgusting [09:11] http://www.steve.org.uk/Software/redisfs/ [09:11] What does that do? [09:12] it's a fuse filesystem to store all the data in a redis database [09:18] Well, I just bought myself the pain of adding indexes to 514 magazine issues. [09:18] However, ALL the lists are RIGHT there, all ready to go, it's 100% copy-paste. [09:18] I should find some excellent This American life episodes to listen to [09:19] http://www.archive.org/details/73-magazine-1971-01 [09:23] SketchCow, for those Jamendo items, isn't there any way to put the license in the licenseurl field, when the whole album has the same one? (Which is probably what happens in most cases.) [09:25] So that they can be searched by license etc. [09:30] It's NOT always the same. [09:30] I do agree it would be nice to be able to inject it. [09:30] But we're now 4,000 albums in [09:32] I found it easier to link it into the consistent description of the txt file. [09:34] I'm booting up the this american life app! [09:36] Coderjoe: heh [10:29] 100 issues finished! [10:30] O_o [10:30] /10_Things_I_Hate_About_You_and_10_Things_I_Hate_About_You_Crossovers/906/5226/ [10:47] crossover with what [10:48] movie to tv show [10:48] lol [10:51] ^--> http://www.imdb.com/title/tt1302019/ [10:52] ugh [11:01] oh dear lord [11:02] also http://www.imdb.com/title/tt1386703/ [11:11] diverging even farther from "We Can Remember It For You Wholesale" [11:11] geh. why is it 6am? [11:31] how big did Jamendo turn out to be? [11:32] like 2.5 years. [11:32] saw that [11:32] disk space? [11:32] when can I put it on my phone? [11:38] it's probably about 1-3tb depending on the quality [11:38] so give it a couple of years before phones have that much space. [11:38] 2.5 years @ 160kbps = 1.5TB, for reference. [11:41] yea, won't be long [11:47] I don't think there are any programs geared towards exploring a collection of this size [18:34] I am just putting up the mp3s. [18:37] Actually, all of the Jamendo collection is 1.8tb [18:38] mine is bigger, ha [18:39] All Jamendo albums from 0000-7999 are up now. [18:39] I'm going to let archive.org settle down for a day and then jam more in. [18:41] is there any chance that you leave the filenames intact? [18:41] No. [18:41] They are absolutely incompatible with the archive.org infrastructure. [18:42] That is what took me so long, striking a devil's bargain between how to get them in there and have information stay intact. [18:42] Now, the ID3 tags of the downloads are ALL intact, so any utility will restore them. [18:42] I don't know why archive.org won't show them, but they're there. [18:44] how are they incompatible? looking at random albums it seems like they just need to be lowercased and stripped of funny characters (and space -> _ ) [18:44] uh? [18:44] Having to modify the natural state of anything to make it function in a new environment is the fundamental definition of incompatibility? [18:45] not even lowercased actually [18:45] http://ia700200.us.archive.org/10/items/1bit_007/ eg [18:52] archive.org seems to show whatever the is in the files.xml, not the ID3 title [18:52] <SketchCow> I agree but there's something deeper. [18:52] <SketchCow> So I'm just doing what it does, which works out OK. [18:52] <SketchCow> I'm currently adding those magazine descriptions for 73 magazine [23:12] <pberry> Is this already known? https://twitter.com/#!/waxpancake/status/146728917112332289 [23:12] <pberry> "Horrible. TwapperKeeper was bought by Hootsuite and is nuking all their Twitter archives with a month's notice: http://twapperkeeper.com/" [23:22] <SketchCow> Yes [23:22] <SketchCow> Waxy's behind. [23:24] <SketchCow> http://twitter.com/#!/textfiles/status/145672352640942080 [23:26] <SketchCow> He doesn't follow me anymore, otherwise he'd have seen it. [23:29] <SketchCow> Mostly, I'm trying to endrun this by forcing Library of Congress to release their twitter archives to archive.org. [23:50] <pberry> ah, that would be cool [23:56] <SketchCow> I've now put in descriptions for 342 issues of 73 magazine. [23:57] <pberry> nice