#archiveteam 2011-12-13,Tue

↑back Search

Time	Nickname	Message
00:25 ^🔗	ff_	hey
00:25 ^🔗	bbot_	hi
01:14 ^🔗	bsmith093	well this is interesting, i tried to run bundle install for the ffgrab ruby project and got a password prompt, put in my pass, it grabbed and installed the proper gems, but i still get this Your bundle is complete! Use `bundle show [gemname]` to see where a bundled gem is installed. ben@ben-laptop:~/1432483$ ruby ffgrab.rb ffgrab.rb:1:in `require': no such file to load -- connection_pool (LoadError) from ffgrab.rb:1
02:48 ^🔗	yipdw	bsmith093: check which Ruby it is
02:49 ^🔗	yipdw	also, I've got a list of IDs
02:49 ^🔗	yipdw	as in complete crawl
02:50 ^🔗	bsmith093	holy crap already?
02:50 ^🔗	bsmith093	also now its spitting back a bunch or already claimed ot fietching it again
02:50 ^🔗	yipdw	http://depot.ninjawedding.org/fanfiction.net_story_ids_2011-12-12.gz
02:50 ^🔗	yipdw	do what you will with those
02:51 ^🔗	bsmith093	so yeah i was using 1.8.7 for some reason use 1.9.3 and itworks fine
02:51 ^🔗	yipdw	not really any reason to run the crawler at this point
02:51 ^🔗	yipdw	at least not for another month or so
02:52 ^🔗	bsmith093	k then, how do i sort the numbers?
02:52 ^🔗	bsmith093	its easier to split them up that way
02:53 ^🔗	arrith	yipdw: which script did you use for that?
02:54 ^🔗	yipdw	arrith: my own
02:54 ^🔗	yipdw	https://gist.github.com/1432483
02:54 ^🔗	yipdw	it's hacked together, but works
02:54 ^🔗	yipdw	threaded crawler
02:54 ^🔗	bsmith093	yipdw: when you first start grabbing storis, re run the id checker, starting from the highest number you got last time
02:55 ^🔗	Coderjoe	it doesn't work that way
02:55 ^🔗	yipdw	it's not really possible to start from a number with that script
02:55 ^🔗	yipdw	that thing starts at a set of known roots and traces from there
02:55 ^🔗	bsmith093	this is plenty, then
02:55 ^🔗	yipdw	using Last-Modified headers and cache-control keys to avoid re-fetching
02:55 ^🔗	yipdw	however I learned that ff.net's Last-Modified headers are not exactly useful wrt determining new stories
02:55 ^🔗	yipdw	so
02:55 ^🔗	yipdw	I don't recommend re-running this particular script
02:56 ^🔗	yipdw	if you want to re-sync, write one that makes use of the current ID knowledge
02:56 ^🔗	yipdw	it may be useful, however, to periodically re-run that grabber
02:56 ^🔗	yipdw	I don't think I missed anything, but if someone discovers a bug in the fetch code that leads to link loss
02:56 ^🔗	yipdw	it'd make sense to re-run it
02:57 ^🔗	yipdw	anyway, to sort
02:57 ^🔗	yipdw	zcat [that file] \| sort -n
02:59 ^🔗	yipdw	if you'd like, you can also sync a Redis DB to what I have
02:59 ^🔗	yipdw	you will need approximately 500 MB free
02:59 ^🔗	yipdw	(of RAM)
02:59 ^🔗	yipdw	the crawler keeps around a lot of state, a bit like wget
02:59 ^🔗	bsmith093	how would i do that,( sync)
03:00 ^🔗	yipdw	you'd need to establish an SSH tunnel to one of my computers and then run the slaveof command on your Redis instance
03:00 ^🔗	yipdw	but I'm not particularly comfortable with allowing people to SSH to my personal systems
03:00 ^🔗	yipdw	so I'll need to set up maybe an EC2 instance for that or some such
03:01 ^🔗	bsmith093	and i suck at tunnelling
03:01 ^🔗	bsmith093	anyway this is great, thanks :D
03:01 ^🔗	yipdw	man ssh, check out the -N and -L options
03:01 ^🔗	yipdw	I can also load up the RDB dump, I guess, but that may have compat issues
03:02 ^🔗	bsmith093	does this handle chappter issues as well?
03:02 ^🔗	yipdw	it only grabs story IDs
03:02 ^🔗	yipdw	my assumption is that the story grabber will handle those
03:02 ^🔗	yipdw	chapters, that is
03:03 ^🔗	yipdw	also, re: the Redis sharing thing, I guess I could also use Redis To Go
03:03 ^🔗	bsmith093	i still like my idea of grabbing the double arrow link and iterating backeards through the chapters ( or forwards) to grab the html pages directly and then work on those
03:03 ^🔗	Coderjoe	ergh
03:04 ^🔗	bsmith093	why ergh?
03:04 ^🔗	yipdw	that's a possibility, but another one is just clicking Next until you don't have another page
03:04 ^🔗	bsmith093	how would i automate that?
03:05 ^🔗	Coderjoe	the ergh was about redis to go
03:05 ^🔗	yipdw	Redis To Go works
03:05 ^🔗	yipdw	I personally would not entrust it with any mission-critical data
03:05 ^🔗	yipdw	but I also would not entrust Redis with mission-critical data :P
03:05 ^🔗	yipdw	well, business-critical
03:05 ^🔗	yipdw	and I mean I wouldn't make Redis the source of truth for that
03:06 ^🔗	Coderjoe	the prices are a little high for a quick transfer of data
03:06 ^🔗	yipdw	yeah
03:06 ^🔗	yipdw	as far as automation goes, I dunno -- maybe wget has something for that built-in; I haven't memorized its options
03:07 ^🔗	bsmith093	neither have i
03:07 ^🔗	yipdw	or you can use another program to find all chapter stops (read the options in the combobox perhaps)
03:07 ^🔗	yipdw	and then feed that list of URLs into wget
03:07 ^🔗	yipdw	that's what the splinder grabber does
03:07 ^🔗	bsmith093	afaik thats all javascript
03:08 ^🔗	yipdw	no, it isn't
03:08 ^🔗	yipdw	go to a story that has chapters and look at the chapter combobox
03:08 ^🔗	bsmith093	i check the source, it looked java-y to me
03:08 ^🔗	yipdw	that is page markup
03:08 ^🔗	bsmith093	oy
03:12 ^🔗	Coderjoe	can redis push into another redis server?
03:12 ^🔗	yipdw	yeah, that's the basis of redis replication
03:12 ^🔗	Coderjoe	as opposed to pull from
03:12 ^🔗	yipdw	oh
03:12 ^🔗	yipdw	no, not as far as I know
03:13 ^🔗	bsmith093	well apparently this is what i would grep the html for <a href="/s/4457761/10/Basic_Instinct">Â»</a> the number after the id im currently on
03:14 ^🔗	bsmith093	also whatever the hell the ode for that funky arrow symbol is
03:17 ^🔗	arrith	doing intelligence recrawls is something i wonder if there's already 'best practices' for
03:18 ^🔗	arrith	i suppose that might be like a search on a graph or something
03:18 ^🔗	bsmith093	almost certainly
03:18 ^🔗	yipdw	there are; a lot of it involves asking the server what changed and respecting its caching parameters
03:18 ^🔗	yipdw	that's about all you can do
03:19 ^🔗	yipdw	searching on a graph is what my crawler does, FYI
03:19 ^🔗	yipdw	anyway, this is what I mean by looking at those combobox entries
03:20 ^🔗	yipdw	throw this into a file or some such and run it with the first argument as a story ID:
03:20 ^🔗	yipdw	require 'net/http'
03:20 ^🔗	yipdw	require 'nokogiri'
03:20 ^🔗	yipdw	(Nokogiri.HTML(Net::HTTP.get(URI.parse("http://www.fanfiction.net/s/#{ARGV[0]}")))/'select[name="chapter"]')[0].children.map(&:text).tap { \|x\| puts x.join("\n") }
03:20 ^🔗	yipdw	on 5909536, for example, it'll print out the names of all 9 chapters
03:21 ^🔗	bsmith093	and that helps, how?
03:21 ^🔗	yipdw	getting the chapter numbers from there involves replacing map(&:text) with map { \|c\| c.attr('value').text } or something along those lines
03:21 ^🔗	yipdw	the brain
03:21 ^🔗	yipdw	use it
03:22 ^🔗	bsmith093	oh ok, i just thought for like 5 sec, and yeah, thats makes link generation much easier ;P
03:22 ^🔗	yipdw	alternatively, since chapters always go from 1 to a highest point, you can just take the number of entries in that list and count from 1 up
03:26 ^🔗	arrith	well afaik searching graphs means things like A* and fancy stuff i don't know yet
03:30 ^🔗	yipdw	nah, doesn't have to
03:31 ^🔗	yipdw	consider stories and categories as vertices and (category, category) and (category, story) tuples as arcs
03:31 ^🔗	yipdw	if you start at a known set of sources and go in a breadth-first manner (like my crawler)
03:31 ^🔗	yipdw	that's breadth-first search
03:32 ^🔗	yipdw	A* search is about minimizing the cost of a path between two vertices
03:32 ^🔗	yipdw	so you deal with a graph that has weighted edges
03:33 ^🔗	yipdw	I can't think of a beneficial use of weighting in this particular scenario, though, so right now I don't think that A* would offer any gains
03:33 ^🔗	yipdw	could be wrong though
03:36 ^🔗	yipdw	where you tend to find A* search, alpha-beta pruning, minimax, etc is in code that needs to make decisions
03:37 ^🔗	yipdw	AI game-playing agents is one example
03:37 ^🔗	yipdw	crawlers, I guess, could use it, if they had some knowledge of network topology ("hit up these first because they're closer", etc)
03:37 ^🔗	yipdw	but I don't know how much of a gain that'd make vs. the added complexity
03:37 ^🔗	yipdw	ask the authors of Heritrix
05:10 ^🔗	arrith	ah Heritrix looks interesting
05:16 ^🔗	Coderjoe	whee
05:16 ^🔗	Coderjoe	redis' make test fails
05:16 ^🔗	kennethre	Coderjoe: lie
05:16 ^🔗	kennethre	*lies
05:16 ^🔗	Coderjoe	"::redis::redis_read_reply $fd"
05:16 ^🔗	Coderjoe	(procedure "::redis::__dispatch__" line 23)
05:16 ^🔗	Coderjoe	Bad protocol, as reply type byte
05:16 ^🔗	Coderjoe	[exception]: Executing test client: Bad protocol, as reply type byte.
05:16 ^🔗	Coderjoe	while executing
05:16 ^🔗	Coderjoe	invoked from within
05:17 ^🔗	kennethre	sounds like a client issue
05:17 ^🔗	kennethre	what are you using?
05:17 ^🔗	Coderjoe	redis's "make test"
05:17 ^🔗	kennethre	ah gotcha
05:18 ^🔗	kennethre	strait from the repo?
05:18 ^🔗	Coderjoe	from the download page
05:18 ^🔗	Coderjoe	the 2.4.4 tarball from the download page
05:19 ^🔗	kennethre	testing..
05:19 ^🔗	kennethre	all good here :)
05:20 ^🔗	Coderjoe	that was rather fast
05:20 ^🔗	kennethre	speedy machine :)
05:20 ^🔗	Coderjoe	yeah... my load averave jumps up to 14 during some of these tests
05:20 ^🔗	yipdw	all good here
05:21 ^🔗	kennethre	Coderjoe: https://gist.github.com/5e482330834dba9594a8
05:21 ^🔗	kennethre	ugh, they should remove color if it's not in a tty
05:21 ^🔗	yipdw	also
05:21 ^🔗	yipdw	$ time make test
05:21 ^🔗	yipdw	real 0m18.342s
05:21 ^🔗	yipdw	user 0m0.040s
05:22 ^🔗	kennethre	make test 12.12s user 3.01s system 88% cpu 17.148 total
05:22 ^🔗	yipdw	foiled
05:22 ^🔗	kennethre	my fans definately kicked in
05:22 ^🔗	*	Coderjoe deletes the directory and starts fresh
05:22 ^🔗	kennethre	Coderjoe: what system are you on?
05:23 ^🔗	kennethre	(os)
05:23 ^🔗	Coderjoe	and the thing is building a hell of a lot faster this time (for the main compiliation)
05:23 ^🔗	Coderjoe	ubuntu 11.10
05:33 ^🔗	Coderjoe	and now running tests again
05:50 ^🔗	Coderjoe	and it puked again, on the same test I think
05:50 ^🔗	Coderjoe	(file "tests/integration/replication-3.tcl" line 1)
06:56 ^🔗	yipdw	Coderjoe: weird
07:23 ^🔗	lemonkey	http://techcrunch.com/2011/12/12/thoora-shuts-down/
07:24 ^🔗	Coderjoe	is the whole internet shutting down?
07:32 ^🔗	yipdw	seems like it's just the parts of the Internet that nobody really needs
07:57 ^🔗	RedType	Just 3 months after its public launch following 2 years of private beta,
07:57 ^🔗	RedType	WELL THERE'S YER PROBLEM
07:57 ^🔗	RedType	Just 3 months after its public launch following 2 years of private beta, content discovery engine Thoora today announced that it will shut down.
07:57 ^🔗	RedType	reposting that whole sentence
08:06 ^🔗	Coderjoe	I don't know what I did, but starting with a fresh instance seems to have corrected it
08:07 ^🔗	Coderjoe	all tests passed
09:10 ^🔗	SketchCow	http://batcave.textfiles.com/ocrcount
09:10 ^🔗	SketchCow	ha ha, take that archive.org
09:10 ^🔗	SketchCow	\o/
09:11 ^🔗	Coderjoe	oh my god this is a little disgusting
09:11 ^🔗	Coderjoe	http://www.steve.org.uk/Software/redisfs/
09:11 ^🔗	SketchCow	What does that do?
09:12 ^🔗	Coderjoe	it's a fuse filesystem to store all the data in a redis database
09:18 ^🔗	SketchCow	Well, I just bought myself the pain of adding indexes to 514 magazine issues.
09:18 ^🔗	SketchCow	However, ALL the lists are RIGHT there, all ready to go, it's 100% copy-paste.
09:18 ^🔗	SketchCow	I should find some excellent This American life episodes to listen to
09:19 ^🔗	SketchCow	http://www.archive.org/details/73-magazine-1971-01
09:23 ^🔗	Nemo_bis	SketchCow, for those Jamendo items, isn't there any way to put the license in the licenseurl field, when the whole album has the same one? (Which is probably what happens in most cases.)
09:25 ^🔗	Nemo_bis	So that they can be searched by license etc.
09:30 ^🔗	SketchCow	It's NOT always the same.
09:30 ^🔗	SketchCow	I do agree it would be nice to be able to inject it.
09:30 ^🔗	SketchCow	But we're now 4,000 albums in
09:32 ^🔗	SketchCow	I found it easier to link it into the consistent description of the txt file.
09:34 ^🔗	SketchCow	I'm booting up the this american life app!
09:36 ^🔗	db48x	Coderjoe: heh
10:29 ^🔗	SketchCow	100 issues finished!
10:30 ^🔗	Coderjoe	O_o
10:30 ^🔗	Coderjoe	/10_Things_I_Hate_About_You_and_10_Things_I_Hate_About_You_Crossovers/906/5226/
10:47 ^🔗	dnova	crossover with what
10:48 ^🔗	Coderjoe	movie to tv show
10:48 ^🔗	dnova	lol
10:51 ^🔗	Soojin	^--> http://www.imdb.com/title/tt1302019/
10:52 ^🔗	dnova	ugh
11:01 ^🔗	Coderjoe	oh dear lord
11:02 ^🔗	Coderjoe	also http://www.imdb.com/title/tt1386703/
11:11 ^🔗	Coderjoe	diverging even farther from "We Can Remember It For You Wholesale"
11:11 ^🔗	Coderjoe	geh. why is it 6am?
11:31 ^🔗	db48x	how big did Jamendo turn out to be?
11:32 ^🔗	dnova	like 2.5 years.
11:32 ^🔗	db48x	saw that
11:32 ^🔗	db48x	disk space?
11:32 ^🔗	db48x	when can I put it on my phone?
11:38 ^🔗	dnova	it's probably about 1-3tb depending on the quality
11:38 ^🔗	dnova	so give it a couple of years before phones have that much space.
11:38 ^🔗	dnova	2.5 years @ 160kbps = 1.5TB, for reference.
11:41 ^🔗	db48x	yea, won't be long
11:47 ^🔗	db48x	I don't think there are any programs geared towards exploring a collection of this size
18:34 ^🔗	SketchCow	I am just putting up the mp3s.
18:37 ^🔗	SketchCow	Actually, all of the Jamendo collection is 1.8tb
18:38 ^🔗	Schbirid	mine is bigger, ha
18:39 ^🔗	SketchCow	All Jamendo albums from 0000-7999 are up now.
18:39 ^🔗	SketchCow	I'm going to let archive.org settle down for a day and then jam more in.
18:41 ^🔗	Schbirid	is there any chance that you leave the filenames intact?
18:41 ^🔗	SketchCow	No.
18:41 ^🔗	SketchCow	They are absolutely incompatible with the archive.org infrastructure.
18:42 ^🔗	SketchCow	That is what took me so long, striking a devil's bargain between how to get them in there and have information stay intact.
18:42 ^🔗	SketchCow	Now, the ID3 tags of the downloads are ALL intact, so any utility will restore them.
18:42 ^🔗	SketchCow	I don't know why archive.org won't show them, but they're there.
18:44 ^🔗	Schbirid	how are they incompatible? looking at random albums it seems like they just need to be lowercased and stripped of funny characters (and space -> _ )
18:44 ^🔗	SketchCow	uh?
18:44 ^🔗	SketchCow	Having to modify the natural state of anything to make it function in a new environment is the fundamental definition of incompatibility?
18:45 ^🔗	Schbirid	not even lowercased actually
18:45 ^🔗	Schbirid	http://ia700200.us.archive.org/10/items/1bit_007/ eg
18:52 ^🔗	DFJustin	archive.org seems to show whatever the <title> is in the files.xml, not the ID3 title
18:52 ^🔗	SketchCow	I agree but there's something deeper.
18:52 ^🔗	SketchCow	So I'm just doing what it does, which works out OK.
18:52 ^🔗	SketchCow	I'm currently adding those magazine descriptions for 73 magazine
23:12 ^🔗	pberry	Is this already known? https://twitter.com/#!/waxpancake/status/146728917112332289
23:12 ^🔗	pberry	"Horrible. TwapperKeeper was bought by Hootsuite and is nuking all their Twitter archives with a month's notice: http://twapperkeeper.com/"
23:22 ^🔗	SketchCow	Yes
23:22 ^🔗	SketchCow	Waxy's behind.
23:24 ^🔗	SketchCow	http://twitter.com/#!/textfiles/status/145672352640942080
23:26 ^🔗	SketchCow	He doesn't follow me anymore, otherwise he'd have seen it.
23:29 ^🔗	SketchCow	Mostly, I'm trying to endrun this by forcing Library of Congress to release their twitter archives to archive.org.
23:50 ^🔗	pberry	ah, that would be cool
23:56 ^🔗	SketchCow	I've now put in descriptions for 342 issues of 73 magazine.
23:57 ^🔗	pberry	nice

irclogger-viewer