#archiveteam 2011-12-13,Tue

โ†‘back Search

Time Nickname Message
00:25 ๐Ÿ”— ff_ hey
00:25 ๐Ÿ”— bbot_ hi
01:14 ๐Ÿ”— bsmith093 well this is interesting, i tried to run bundle install for the ffgrab ruby project and got a password prompt, put in my pass, it grabbed and installed the proper gems, but i still get this Your bundle is complete! Use `bundle show [gemname]` to see where a bundled gem is installed. ben@ben-laptop:~/1432483$ ruby ffgrab.rb ffgrab.rb:1:in `require': no such file to load -- connection_pool (LoadError) from ffgrab.rb:1
02:48 ๐Ÿ”— yipdw bsmith093: check which Ruby it is
02:49 ๐Ÿ”— yipdw also, I've got a list of IDs
02:49 ๐Ÿ”— yipdw as in complete crawl
02:50 ๐Ÿ”— bsmith093 holy crap already?
02:50 ๐Ÿ”— bsmith093 also now its spitting back a bunch or already claimed ot fietching it again
02:50 ๐Ÿ”— yipdw http://depot.ninjawedding.org/fanfiction.net_story_ids_2011-12-12.gz
02:50 ๐Ÿ”— yipdw do what you will with those
02:51 ๐Ÿ”— bsmith093 so yeah i was using 1.8.7 for some reason use 1.9.3 and itworks fine
02:51 ๐Ÿ”— yipdw not really any reason to run the crawler at this point
02:51 ๐Ÿ”— yipdw at least not for another month or so
02:52 ๐Ÿ”— bsmith093 k then, how do i sort the numbers?
02:52 ๐Ÿ”— bsmith093 its easier to split them up that way
02:53 ๐Ÿ”— arrith yipdw: which script did you use for that?
02:54 ๐Ÿ”— yipdw arrith: my own
02:54 ๐Ÿ”— yipdw https://gist.github.com/1432483
02:54 ๐Ÿ”— yipdw it's hacked together, but works
02:54 ๐Ÿ”— yipdw threaded crawler
02:54 ๐Ÿ”— bsmith093 yipdw: when you first start grabbing storis, re run the id checker, starting from the highest number you got last time
02:55 ๐Ÿ”— Coderjoe it doesn't work that way
02:55 ๐Ÿ”— yipdw it's not really possible to start from a number with that script
02:55 ๐Ÿ”— yipdw that thing starts at a set of known roots and traces from there
02:55 ๐Ÿ”— bsmith093 this is plenty, then
02:55 ๐Ÿ”— yipdw using Last-Modified headers and cache-control keys to avoid re-fetching
02:55 ๐Ÿ”— yipdw however I learned that ff.net's Last-Modified headers are not exactly useful wrt determining new stories
02:55 ๐Ÿ”— yipdw so
02:55 ๐Ÿ”— yipdw I don't recommend re-running this particular script
02:56 ๐Ÿ”— yipdw if you want to re-sync, write one that makes use of the current ID knowledge
02:56 ๐Ÿ”— yipdw it may be useful, however, to periodically re-run that grabber
02:56 ๐Ÿ”— yipdw I don't think I missed anything, but if someone discovers a bug in the fetch code that leads to link loss
02:56 ๐Ÿ”— yipdw it'd make sense to re-run it
02:57 ๐Ÿ”— yipdw anyway, to sort
02:57 ๐Ÿ”— yipdw zcat [that file] | sort -n
02:59 ๐Ÿ”— yipdw if you'd like, you can also sync a Redis DB to what I have
02:59 ๐Ÿ”— yipdw you will need approximately 500 MB free
02:59 ๐Ÿ”— yipdw (of RAM)
02:59 ๐Ÿ”— yipdw the crawler keeps around a lot of state, a bit like wget
02:59 ๐Ÿ”— bsmith093 how would i do that,( sync)
03:00 ๐Ÿ”— yipdw you'd need to establish an SSH tunnel to one of my computers and then run the slaveof command on your Redis instance
03:00 ๐Ÿ”— yipdw but I'm not particularly comfortable with allowing people to SSH to my personal systems
03:00 ๐Ÿ”— yipdw so I'll need to set up maybe an EC2 instance for that or some such
03:01 ๐Ÿ”— bsmith093 and i suck at tunnelling
03:01 ๐Ÿ”— bsmith093 anyway this is great, thanks :D
03:01 ๐Ÿ”— yipdw man ssh, check out the -N and -L options
03:01 ๐Ÿ”— yipdw I can also load up the RDB dump, I guess, but that may have compat issues
03:02 ๐Ÿ”— bsmith093 does this handle chappter issues as well?
03:02 ๐Ÿ”— yipdw it only grabs story IDs
03:02 ๐Ÿ”— yipdw my assumption is that the story grabber will handle those
03:02 ๐Ÿ”— yipdw chapters, that is
03:03 ๐Ÿ”— yipdw also, re: the Redis sharing thing, I guess I could also use Redis To Go
03:03 ๐Ÿ”— bsmith093 i still like my idea of grabbing the double arrow link and iterating backeards through the chapters ( or forwards) to grab the html pages directly and then work on those
03:03 ๐Ÿ”— Coderjoe ergh
03:04 ๐Ÿ”— bsmith093 why ergh?
03:04 ๐Ÿ”— yipdw that's a possibility, but another one is just clicking Next until you don't have another page
03:04 ๐Ÿ”— bsmith093 how would i automate that?
03:05 ๐Ÿ”— Coderjoe the ergh was about redis to go
03:05 ๐Ÿ”— yipdw Redis To Go works
03:05 ๐Ÿ”— yipdw I personally would not entrust it with any mission-critical data
03:05 ๐Ÿ”— yipdw but I also would not entrust Redis with mission-critical data :P
03:05 ๐Ÿ”— yipdw well, business-critical
03:05 ๐Ÿ”— yipdw and I mean I wouldn't make Redis the source of truth for that
03:06 ๐Ÿ”— Coderjoe the prices are a little high for a quick transfer of data
03:06 ๐Ÿ”— yipdw yeah
03:06 ๐Ÿ”— yipdw as far as automation goes, I dunno -- maybe wget has something for that built-in; I haven't memorized its options
03:07 ๐Ÿ”— bsmith093 neither have i
03:07 ๐Ÿ”— yipdw or you can use another program to find all chapter stops (read the options in the combobox perhaps)
03:07 ๐Ÿ”— yipdw and then feed that list of URLs into wget
03:07 ๐Ÿ”— yipdw that's what the splinder grabber does
03:07 ๐Ÿ”— bsmith093 afaik thats all javascript
03:08 ๐Ÿ”— yipdw no, it isn't
03:08 ๐Ÿ”— yipdw go to a story that has chapters and look at the chapter combobox
03:08 ๐Ÿ”— bsmith093 i check the source, it looked java-y to me
03:08 ๐Ÿ”— yipdw that is page markup
03:08 ๐Ÿ”— bsmith093 oy
03:12 ๐Ÿ”— Coderjoe can redis push into another redis server?
03:12 ๐Ÿ”— yipdw yeah, that's the basis of redis replication
03:12 ๐Ÿ”— Coderjoe as opposed to pull from
03:12 ๐Ÿ”— yipdw oh
03:12 ๐Ÿ”— yipdw no, not as far as I know
03:13 ๐Ÿ”— bsmith093 well apparently this is what i would grep the html for <a href="/s/4457761/10/Basic_Instinct">ร‚ยป</a> the number after the id im currently on
03:14 ๐Ÿ”— bsmith093 also whatever the hell the ode for that funky arrow symbol is
03:17 ๐Ÿ”— arrith doing intelligence recrawls is something i wonder if there's already 'best practices' for
03:18 ๐Ÿ”— arrith i suppose that might be like a search on a graph or something
03:18 ๐Ÿ”— bsmith093 almost certainly
03:18 ๐Ÿ”— yipdw there are; a lot of it involves asking the server what changed and respecting its caching parameters
03:18 ๐Ÿ”— yipdw that's about all you can do
03:19 ๐Ÿ”— yipdw searching on a graph is what my crawler does, FYI
03:19 ๐Ÿ”— yipdw anyway, this is what I mean by looking at those combobox entries
03:20 ๐Ÿ”— yipdw throw this into a file or some such and run it with the first argument as a story ID:
03:20 ๐Ÿ”— yipdw require 'net/http'
03:20 ๐Ÿ”— yipdw require 'nokogiri'
03:20 ๐Ÿ”— yipdw (Nokogiri.HTML(Net::HTTP.get(URI.parse("http://www.fanfiction.net/s/#{ARGV[0]}")))/'select[name="chapter"]')[0].children.map(&:text).tap { |x| puts x.join("\n") }
03:20 ๐Ÿ”— yipdw on 5909536, for example, it'll print out the names of all 9 chapters
03:21 ๐Ÿ”— bsmith093 and that helps, how?
03:21 ๐Ÿ”— yipdw getting the chapter numbers from there involves replacing map(&:text) with map { |c| c.attr('value').text } or something along those lines
03:21 ๐Ÿ”— yipdw the brain
03:21 ๐Ÿ”— yipdw use it
03:22 ๐Ÿ”— bsmith093 oh ok, i just thought for like 5 sec, and yeah, thats makes link generation much easier ;P
03:22 ๐Ÿ”— yipdw alternatively, since chapters always go from 1 to a highest point, you can just take the number of entries in that list and count from 1 up
03:26 ๐Ÿ”— arrith well afaik searching graphs means things like A* and fancy stuff i don't know yet
03:30 ๐Ÿ”— yipdw nah, doesn't have to
03:31 ๐Ÿ”— yipdw consider stories and categories as vertices and (category, category) and (category, story) tuples as arcs
03:31 ๐Ÿ”— yipdw if you start at a known set of sources and go in a breadth-first manner (like my crawler)
03:31 ๐Ÿ”— yipdw that's breadth-first search
03:32 ๐Ÿ”— yipdw A* search is about minimizing the cost of a path between two vertices
03:32 ๐Ÿ”— yipdw so you deal with a graph that has weighted edges
03:33 ๐Ÿ”— yipdw I can't think of a beneficial use of weighting in this particular scenario, though, so right now I don't think that A* would offer any gains
03:33 ๐Ÿ”— yipdw could be wrong though
03:36 ๐Ÿ”— yipdw where you tend to find A* search, alpha-beta pruning, minimax, etc is in code that needs to make decisions
03:37 ๐Ÿ”— yipdw AI game-playing agents is one example
03:37 ๐Ÿ”— yipdw crawlers, I guess, could use it, if they had some knowledge of network topology ("hit up these first because they're closer", etc)
03:37 ๐Ÿ”— yipdw but I don't know how much of a gain that'd make vs. the added complexity
03:37 ๐Ÿ”— yipdw ask the authors of Heritrix
05:10 ๐Ÿ”— arrith ah Heritrix looks interesting
05:16 ๐Ÿ”— Coderjoe whee
05:16 ๐Ÿ”— Coderjoe redis' make test fails
05:16 ๐Ÿ”— kennethre Coderjoe: lie
05:16 ๐Ÿ”— kennethre *lies
05:16 ๐Ÿ”— Coderjoe "::redis::redis_read_reply $fd"
05:16 ๐Ÿ”— Coderjoe (procedure "::redis::__dispatch__" line 23)
05:16 ๐Ÿ”— Coderjoe Bad protocol, as reply type byte
05:16 ๐Ÿ”— Coderjoe [exception]: Executing test client: Bad protocol, as reply type byte.
05:16 ๐Ÿ”— Coderjoe while executing
05:16 ๐Ÿ”— Coderjoe invoked from within
05:17 ๐Ÿ”— kennethre sounds like a client issue
05:17 ๐Ÿ”— kennethre what are you using?
05:17 ๐Ÿ”— Coderjoe redis's "make test"
05:17 ๐Ÿ”— kennethre ah gotcha
05:18 ๐Ÿ”— kennethre strait from the repo?
05:18 ๐Ÿ”— Coderjoe from the download page
05:18 ๐Ÿ”— Coderjoe the 2.4.4 tarball from the download page
05:19 ๐Ÿ”— kennethre testing..
05:19 ๐Ÿ”— kennethre all good here :)
05:20 ๐Ÿ”— Coderjoe that was rather fast
05:20 ๐Ÿ”— kennethre speedy machine :)
05:20 ๐Ÿ”— Coderjoe yeah... my load averave jumps up to 14 during some of these tests
05:20 ๐Ÿ”— yipdw all good here
05:21 ๐Ÿ”— kennethre Coderjoe: https://gist.github.com/5e482330834dba9594a8
05:21 ๐Ÿ”— kennethre ugh, they should remove color if it's not in a tty
05:21 ๐Ÿ”— yipdw also
05:21 ๐Ÿ”— yipdw $ time make test
05:21 ๐Ÿ”— yipdw real 0m18.342s
05:21 ๐Ÿ”— yipdw user 0m0.040s
05:22 ๐Ÿ”— kennethre make test 12.12s user 3.01s system 88% cpu 17.148 total
05:22 ๐Ÿ”— yipdw foiled
05:22 ๐Ÿ”— kennethre my fans definately kicked in
05:22 ๐Ÿ”— * Coderjoe deletes the directory and starts fresh
05:22 ๐Ÿ”— kennethre Coderjoe: what system are you on?
05:23 ๐Ÿ”— kennethre (os)
05:23 ๐Ÿ”— Coderjoe and the thing is building a hell of a lot faster this time (for the main compiliation)
05:23 ๐Ÿ”— Coderjoe ubuntu 11.10
05:33 ๐Ÿ”— Coderjoe and now running tests again
05:50 ๐Ÿ”— Coderjoe and it puked again, on the same test I think
05:50 ๐Ÿ”— Coderjoe (file "tests/integration/replication-3.tcl" line 1)
06:56 ๐Ÿ”— yipdw Coderjoe: weird
07:23 ๐Ÿ”— lemonkey http://techcrunch.com/2011/12/12/thoora-shuts-down/
07:24 ๐Ÿ”— Coderjoe is the whole internet shutting down?
07:32 ๐Ÿ”— yipdw seems like it's just the parts of the Internet that nobody really needs
07:57 ๐Ÿ”— RedType Just 3 months after its public launch following 2 years of private beta,
07:57 ๐Ÿ”— RedType WELL THERE'S YER PROBLEM
07:57 ๐Ÿ”— RedType Just 3 months after its public launch following 2 years of private beta, content discovery engine Thoora today announced that it will shut down.
07:57 ๐Ÿ”— RedType reposting that whole sentence
08:06 ๐Ÿ”— Coderjoe I don't know what I did, but starting with a fresh instance seems to have corrected it
08:07 ๐Ÿ”— Coderjoe all tests passed
09:10 ๐Ÿ”— SketchCow http://batcave.textfiles.com/ocrcount
09:10 ๐Ÿ”— SketchCow ha ha, take that archive.org
09:10 ๐Ÿ”— SketchCow \o/
09:11 ๐Ÿ”— Coderjoe oh my god this is a little disgusting
09:11 ๐Ÿ”— Coderjoe http://www.steve.org.uk/Software/redisfs/
09:11 ๐Ÿ”— SketchCow What does that do?
09:12 ๐Ÿ”— Coderjoe it's a fuse filesystem to store all the data in a redis database
09:18 ๐Ÿ”— SketchCow Well, I just bought myself the pain of adding indexes to 514 magazine issues.
09:18 ๐Ÿ”— SketchCow However, ALL the lists are RIGHT there, all ready to go, it's 100% copy-paste.
09:18 ๐Ÿ”— SketchCow I should find some excellent This American life episodes to listen to
09:19 ๐Ÿ”— SketchCow http://www.archive.org/details/73-magazine-1971-01
09:23 ๐Ÿ”— Nemo_bis SketchCow, for those Jamendo items, isn't there any way to put the license in the licenseurl field, when the whole album has the same one? (Which is probably what happens in most cases.)
09:25 ๐Ÿ”— Nemo_bis So that they can be searched by license etc.
09:30 ๐Ÿ”— SketchCow It's NOT always the same.
09:30 ๐Ÿ”— SketchCow I do agree it would be nice to be able to inject it.
09:30 ๐Ÿ”— SketchCow But we're now 4,000 albums in
09:32 ๐Ÿ”— SketchCow I found it easier to link it into the consistent description of the txt file.
09:34 ๐Ÿ”— SketchCow I'm booting up the this american life app!
09:36 ๐Ÿ”— db48x Coderjoe: heh
10:29 ๐Ÿ”— SketchCow 100 issues finished!
10:30 ๐Ÿ”— Coderjoe O_o
10:30 ๐Ÿ”— Coderjoe /10_Things_I_Hate_About_You_and_10_Things_I_Hate_About_You_Crossovers/906/5226/
10:47 ๐Ÿ”— dnova crossover with what
10:48 ๐Ÿ”— Coderjoe movie to tv show
10:48 ๐Ÿ”— dnova lol
10:51 ๐Ÿ”— Soojin ^--> http://www.imdb.com/title/tt1302019/
10:52 ๐Ÿ”— dnova ugh
11:01 ๐Ÿ”— Coderjoe oh dear lord
11:02 ๐Ÿ”— Coderjoe also http://www.imdb.com/title/tt1386703/
11:11 ๐Ÿ”— Coderjoe diverging even farther from "We Can Remember It For You Wholesale"
11:11 ๐Ÿ”— Coderjoe geh. why is it 6am?
11:31 ๐Ÿ”— db48x how big did Jamendo turn out to be?
11:32 ๐Ÿ”— dnova like 2.5 years.
11:32 ๐Ÿ”— db48x saw that
11:32 ๐Ÿ”— db48x disk space?
11:32 ๐Ÿ”— db48x when can I put it on my phone?
11:38 ๐Ÿ”— dnova it's probably about 1-3tb depending on the quality
11:38 ๐Ÿ”— dnova so give it a couple of years before phones have that much space.
11:38 ๐Ÿ”— dnova 2.5 years @ 160kbps = 1.5TB, for reference.
11:41 ๐Ÿ”— db48x yea, won't be long
11:47 ๐Ÿ”— db48x I don't think there are any programs geared towards exploring a collection of this size
18:34 ๐Ÿ”— SketchCow I am just putting up the mp3s.
18:37 ๐Ÿ”— SketchCow Actually, all of the Jamendo collection is 1.8tb
18:38 ๐Ÿ”— Schbirid mine is bigger, ha
18:39 ๐Ÿ”— SketchCow All Jamendo albums from 0000-7999 are up now.
18:39 ๐Ÿ”— SketchCow I'm going to let archive.org settle down for a day and then jam more in.
18:41 ๐Ÿ”— Schbirid is there any chance that you leave the filenames intact?
18:41 ๐Ÿ”— SketchCow No.
18:41 ๐Ÿ”— SketchCow They are absolutely incompatible with the archive.org infrastructure.
18:42 ๐Ÿ”— SketchCow That is what took me so long, striking a devil's bargain between how to get them in there and have information stay intact.
18:42 ๐Ÿ”— SketchCow Now, the ID3 tags of the downloads are ALL intact, so any utility will restore them.
18:42 ๐Ÿ”— SketchCow I don't know why archive.org won't show them, but they're there.
18:44 ๐Ÿ”— Schbirid how are they incompatible? looking at random albums it seems like they just need to be lowercased and stripped of funny characters (and space -> _ )
18:44 ๐Ÿ”— SketchCow uh?
18:44 ๐Ÿ”— SketchCow Having to modify the natural state of anything to make it function in a new environment is the fundamental definition of incompatibility?
18:45 ๐Ÿ”— Schbirid not even lowercased actually
18:45 ๐Ÿ”— Schbirid http://ia700200.us.archive.org/10/items/1bit_007/ eg
18:52 ๐Ÿ”— DFJustin archive.org seems to show whatever the <title> is in the files.xml, not the ID3 title
18:52 ๐Ÿ”— SketchCow I agree but there's something deeper.
18:52 ๐Ÿ”— SketchCow So I'm just doing what it does, which works out OK.
18:52 ๐Ÿ”— SketchCow I'm currently adding those magazine descriptions for 73 magazine
23:12 ๐Ÿ”— pberry Is this already known? https://twitter.com/#!/waxpancake/status/146728917112332289
23:12 ๐Ÿ”— pberry "Horrible. TwapperKeeper was bought by Hootsuite and is nuking all their Twitter archives with a month's notice: http://twapperkeeper.com/"
23:22 ๐Ÿ”— SketchCow Yes
23:22 ๐Ÿ”— SketchCow Waxy's behind.
23:24 ๐Ÿ”— SketchCow http://twitter.com/#!/textfiles/status/145672352640942080
23:26 ๐Ÿ”— SketchCow He doesn't follow me anymore, otherwise he'd have seen it.
23:29 ๐Ÿ”— SketchCow Mostly, I'm trying to endrun this by forcing Library of Congress to release their twitter archives to archive.org.
23:50 ๐Ÿ”— pberry ah, that would be cool
23:56 ๐Ÿ”— SketchCow I've now put in descriptions for 342 issues of 73 magazine.
23:57 ๐Ÿ”— pberry nice

irclogger-viewer