#archiveteam 2011-12-07,Wed

↑back Search

Time	Nickname	Message
00:39 ^🔗	pberry	slow mobile me is still slow
01:54 ^🔗	SketchCow	I can.t put my finger on the precise Reasons.
01:54 ^🔗	SketchCow	I think the Audio-Quality needs work.
01:54 ^🔗	SketchCow	I hear a bit of echo near the start.
01:54 ^🔗	SketchCow	The Face (especially the eyes) is in some shots slightly out of Focus.
01:54 ^🔗	SketchCow	The Moving shot is a bit of a risk with fixed lenses.
01:54 ^🔗	zetathust	for mug shots
01:54 ^🔗	SketchCow	The Focussing Bit near the End distracts a bit (I think Camera Lenses overshoot to fast when adjusting).
01:54 ^🔗	SketchCow	Barely literate critics, have to love them.
02:10 ^🔗	underscor	hahaha
02:19 ^🔗	rude___	Tell him that certain types of eyes simply absorb vast amounts of light into their cones, throwing the shot slightly out of focus.. it's rare, it's unavoidable, shit happens.
02:19 ^🔗	underscor	lol
02:20 ^🔗	yipdw	"camera lenses overshoot to fast when adjusting"
02:20 ^🔗	zetathust	lenses
02:20 ^🔗	yipdw	what
02:22 ^🔗	yipdw	eesh, yikes
02:22 ^🔗	yipdw	[ec2-user@ip-10-243-119-16 files.splinder.com]$ pwd; ls -1 \| wc -l
02:22 ^🔗	yipdw	22046
02:34 ^🔗	*	SketchCow boots zetathust, and does this: http://www.youtube.com/watch?v=Mu71EAdnjQ0
02:43 ^🔗	dashcloud	if someone's got a better place for me to upload the 7z tell me, otherwise here's a link to it on mediafire: http://www.mediafire.com/?49kgs4umrb79a34
02:44 ^🔗	PatC	If you get a dropbox you can upload it there and copy a public url
02:44 ^🔗	dashcloud	I don't actually
02:45 ^🔗	dashcloud	I don't really have anywhere else to put it online that I can share from, so my apologies there- but it is a pretty small download
02:58 ^🔗	SketchCow	Why not just throw on batcave?
02:58 ^🔗	SketchCow	I can make it browsable.
03:02 ^🔗	dashcloud	I don't have any logins or access- if you want to PM me something, I can throw it up there right away
03:03 ^🔗	Coderjoe	you don't have rsync?
03:05 ^🔗	dashcloud	I do have rsync
03:10 ^🔗	dashcloud	so how would I go out using rsync to get the folder onto batcave?
03:15 ^🔗	dashcloud	okay- it's uploading to batcave
03:19 ^🔗	dashcloud	okay- it's up there
03:23 ^🔗	dashcloud	just as a note- there are some gaps in the archive, because some pages the site points to simply aren't there anymore
03:33 ^🔗	bsmith093	anything for the ffnet scrape
03:36 ^🔗	bsmith093	is it possible to upload directly into ia, using ftp
03:47 ^🔗	chronomex	no, but you can use http
03:47 ^🔗	chronomex	http://www.archive.org/help/abouts3.txt
03:47 ^🔗	yipdw	oh, that's awesome
03:47 ^🔗	yipdw	I didn't know IA had an S3 interface
03:48 ^🔗	yipdw	that means I can reuse AWS::S3 and all the fun related bits
03:48 ^🔗	chronomex	yeah, it's super rad.
03:49 ^🔗	chronomex	that, and being able to specify metadata with http headers, means you can drop items into archive.org from shellscripts with 0 hassle
03:49 ^🔗	yipdw	I quite like that
03:51 ^🔗	DFJustin	http://www.archive.org/create.php?ftp=1
03:53 ^🔗	chronomex	oh, yeah, you can do that but it's kind of lousy.
03:58 ^🔗	bsmith093	yes but its easier to do that , than to use ftp to login firsat and create the mxml by hand
03:58 ^🔗	bsmith093	btw, as a library, IA kicks LoC in the nutsack
03:59 ^🔗	DFJustin	loc's catalog is really nice
04:01 ^🔗	bsmith093	mostly because, when I search for anything in the Archvie, i can take the url of the search and dump it into jdownloader, which will then proceed to load and look for links, and find every single result on the page,and give me human readable links for them so i can poick and choose, without even having o click on each individual result
04:02 ^🔗	bsmith093	seriously thought, LoC website search is worse than useless, because it makes me give up, rather than keep looking, its just that bad
04:03 ^🔗	DFJustin	yeah the interface sucks
04:03 ^🔗	DFJustin	but at least the metadata is correct and somewhat consistent
04:03 ^🔗	underscor	yipdw: Unless you need to create items with directories
04:03 ^🔗	underscor	Then it sucks
04:03 ^🔗	underscor	Although, it's a lot easier now that I have internal access
04:04 ^🔗	yipdw	oh, I was thinking of using it to shove WARCs at the IA
04:05 ^🔗	underscor	Then it's probably perfect
04:07 ^🔗	SketchCow	TECHNICALLY it's not S3
04:07 ^🔗	SketchCow	It's S3 like.
04:08 ^🔗	SketchCow	Until this calms down: http://www.archive.org/~tracey/mrtg/derivesg.html
04:08 ^🔗	SketchCow	I'll be focusing on other things.
04:10 ^🔗	underscor	Oh wow
04:11 ^🔗	underscor	It's mostly ximm with all his forever-running heritrix crawls
04:14 ^🔗	bsmith093	yipdw: but wouldn't you need to hand vreate an xml for each warc file?
04:15 ^🔗	yipdw	bsmith093: why would I need to hand-create it
04:16 ^🔗	underscor	S3 automatically creates the necessary XML based off of the headers you pass in
05:13 ^🔗	SketchCow	http://www.poe-news.com/forums/sp.php?pi=1002546492
05:13 ^🔗	SketchCow	poe-news.com has announced they're shutting down.
05:14 ^🔗	bsmith093	start the warc
05:19 ^🔗	bsmith093	this good? wget-warc -mpke robots=off -U "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" --warc-cdx --warc-file=poe-news.com_12022011 www.poe-news.com
05:20 ^🔗	dnova	bsmith093: can you give me a very succinct idea of the current state of ffnet project?
05:20 ^🔗	dnova	or a very meandering, sloppy narrative
05:20 ^🔗	dnova	that'll work too
05:20 ^🔗	bsmith093	have ideas, cant code, got someihing half baked and don ish
05:21 ^🔗	bsmith093	underscor's working on a script to grab reviews and stories with storyinator
05:21 ^🔗	bsmith093	im just iterating thriugh every possible ffnet id, and culling the bad ones to make a linklist
05:28 ^🔗	bsmith093	underscor's way is almost certianly faster
05:32 ^🔗	arrith	spidering the site like yipdw suggested might be the fastest
05:33 ^🔗	dnova	arrith: can you explain #2 in the "extra credit"?
05:33 ^🔗	dnova	http://learnpythonthehardway.org/book/ex8.html
05:35 ^🔗	arrith	dnova: notice where double quotes get used versus where single quotes get used
05:35 ^🔗	arrith	there's something unique about he double quoted sentence
05:35 ^🔗	dnova	OH.
05:35 ^🔗	dnova	the single quote wasn't escaped
05:36 ^🔗	arrith	kinda
05:36 ^🔗	arrith	just that there is a single quote
05:36 ^🔗	arrith	usually when there's a single quote people use doubles
05:36 ^🔗	dnova	hmph. well ok. thanks :)
05:36 ^🔗	arrith	but yeah, you can escape it
05:37 ^🔗	arrith	i dunno actually if people usually escape or not
05:37 ^🔗	arrith	i've only seen doubles used then but i've only seen tutorialish code
05:56 ^🔗	bsmith093	spidering, i dont know how to tell wget to spider and save a linklist to then go back to
05:56 ^🔗	arrith	not spider with wget
05:56 ^🔗	arrith	spider with a ruby script that goes through the categories
05:56 ^🔗	bsmith093	also, on IA is it possible to edit an existing item?
05:56 ^🔗	arrith	or python script
05:59 ^🔗	bsmith093	wheres the script, and how do i runit?
06:00 ^🔗	bsmith093	ive got something by underscor from a repo, that looks like ruby
06:01 ^🔗	arrith	there isn't one
06:01 ^🔗	arrith	you gotta make it
06:01 ^🔗	bsmith093	ugh
06:02 ^🔗	bsmith093	pardon me by yipdw git://gist.github.com/1432483.git
06:10 ^🔗	yipdw	eh?
06:10 ^🔗	yipdw	oh
06:10 ^🔗	bsmith093	yeah hiws that going, any updates
06:10 ^🔗	yipdw	yeah, I maintain that only hitting what you need to hit is the fastest way to do it
06:10 ^🔗	yipdw	I haven't touched it since then
06:10 ^🔗	yipdw	other work, etc.
06:11 ^🔗	yipdw	I think arrith wanted to port it to Python
06:11 ^🔗	yipdw	you can run it right now, if you have a Ruby 1.9 environment with the connection_pool, girl_friday and mechanize gems installed
06:12 ^🔗	bsmith093	ok , wonderful, now, how do i get those modules installed?
06:14 ^🔗	bsmith093	rubygems1.9.1 or 1.9
06:19 ^🔗	arrith	haha
06:20 ^🔗	arrith	yipdw: yeah i was basically waiting to see what underscor ends up with and go from there
06:20 ^🔗	bsmith093	seriously how do i get those ruby modules installed?
06:20 ^🔗	arrith	possibly switching to a spidering method to get updates
06:20 ^🔗	arrith	bsmith093: http://www.google.com/search?q=rubygems+ubuntu
06:31 ^🔗	dnova	good god
06:31 ^🔗	dnova	bsmith: how many stories are on ffnet?
06:31 ^🔗	dnova	do we know?
06:33 ^🔗	dnova	less than or equal to 10,000,000 it looks like?
06:35 ^🔗	dnova	or: what is the highest valid ID you've found?
06:36 ^🔗	bsmith093	~7million
06:36 ^🔗	bsmith093	can some kind [erson walk me through how to insatll girl_friday gem, ive found the darn thing but it wont install with gem install
06:37 ^🔗	bsmith093	https://github.com/mperham/girl_friday.git
06:41 ^🔗	bsmith093	anyone?
06:43 ^🔗	dnova	I have no ruby experience, sorry.
06:44 ^🔗	bsmith093	arrith
06:46 ^🔗	dnova	bsmith,
06:46 ^🔗	dnova	I think you need to relax just a little bit
06:48 ^🔗	dnova	I added the project to the wiki frontpage
06:49 ^🔗	bsmith093	yeah , i know, im overtired and really need to sleep
06:49 ^🔗	chronomex	dnova: spot on.
06:50 ^🔗	dnova	ooh, thanks, chronomex
06:50 ^🔗	dnova	any ideas/critiques are welcome
06:51 ^🔗	chronomex	I meant with respect to relaxing, but the link looks good :)
06:51 ^🔗	dnova	oh, lol
06:52 ^🔗	chronomex	dang, it's been two months since I've uploaded anything
06:52 ^🔗	chronomex	get busy time
06:53 ^🔗	dnova	are you running the fix-dld script or what
06:53 ^🔗	dnova	where are you getting all those splinder profiles!!
06:53 ^🔗	chronomex	me?
06:53 ^🔗	chronomex	I'm fix-dld
06:53 ^🔗	dnova	ahh figured :D
06:53 ^🔗	chronomex	was offline for a while.
06:53 ^🔗	dnova	I'm downloading 2 users. have been for like 4 days
06:53 ^🔗	dnova	one is over 12gb
06:53 ^🔗	dnova	one is over 3
06:54 ^🔗	dnova	I lost one that was over 10gb because I ran out of ram+swap :(
06:54 ^🔗	chronomex	using tmpfs?
06:54 ^🔗	chronomex	tmpfs is only a good idea for when you're doing a bunch of threads simultaneously
06:54 ^🔗	dnova	not the way its supposed to be (i.e., not a ramdisk)
06:55 ^🔗	chronomex	?
06:55 ^🔗	chronomex	no, the upload I'm doing now is to archive.org and not an archiveteam thing.
06:56 ^🔗	chronomex	http://www.archive.org/details/bellsystem_PK-1C901-01
06:56 ^🔗	bsmith093	well, gnight/gmorning ,all, im gonna go sleep like i should have done 2hrs ago bye
06:56 ^🔗	dnova	bsmith093: sleep well.
06:56 ^🔗	chronomex	sleep well!
06:56 ^🔗	chronomex	arrrgh
06:56 ^🔗	dnova	:D
06:57 ^🔗	bsmith093	chronomex: ook now what?
06:57 ^🔗	chronomex	bsmith093: ?
06:57 ^🔗	bsmith093	you said aargh
06:57 ^🔗	chronomex	nvm
06:58 ^🔗	bsmith093	k night bye
06:59 ^🔗	dnova	heh.
07:07 ^🔗	yipdw	bsmith093: easiest way to install it is to get a Ruby environment, get Bundler (gem install bundler), and then install all the gems in the bundle (bundle install)
07:31 ^🔗	SketchCow	Ops, please
16:42 ^🔗	DFJustin	http://rbelmont.mameworld.info/?p=689
17:37 ^🔗	emijrp	SketchCow: http://fromthepage.balboaparkonline.org/display/display_page?ol=w_rw_p_pl&page_id=1363#page/n0/mode/1up
19:03 ^🔗	SketchCow	Nice
19:09 ^🔗	PepsiMax	Aww yeah
19:09 ^🔗	PepsiMax	Got my new VDSL2 hooked up.
19:09 ^🔗	PepsiMax	263.90kB/s uploading to alard
19:09 ^🔗	PepsiMax	alard: more anyhub is coming!
20:57 ^🔗	bsmith093	and i just installed the gem connection_pool
20:57 ^🔗	bsmith093	ok i got ruby gems to install finally, and their all setup, except im still getting this error ffgrab.rb:1:in `require': no such file to load -- connection_pool (LoadError)
21:01 ^🔗	yipdw	bsmith093: ruby -v
21:02 ^🔗	yipdw	actually, just send me your full terminal log
21:02 ^🔗	bsmith093	ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]
21:02 ^🔗	yipdw	connection_pool does not work with Ruby 1.8.7, because it uses BasicObject, which only exists in Ruby 1.9
21:02 ^🔗	yipdw	also, Ruby 1.9 automatically loads Rubygems; 1.8.7 doesn't
21:02 ^🔗	bsmith093	apt install ruby1.9
21:02 ^🔗	yipdw	which is where the error you're seeing comes from
21:02 ^🔗	bsmith093	cause i think i did that
21:03 ^🔗	bsmith093	ruby1.9 is already the newest version.
21:03 ^🔗	yipdw	ruby1.9 -v
21:03 ^🔗	bsmith093	ruby 1.9.0 (2008-10-04 revision 19669) [i486-linux]
21:04 ^🔗	yipdw	ugh
21:04 ^🔗	yipdw	that's...way behind
21:04 ^🔗	bsmith093	ah, another repo?
21:04 ^🔗	yipdw	Ruby (and projects like it) move too fast for Debian/Ubuntu to keep up, IMO
21:04 ^🔗	bsmith093	oh wait yeah i just noticed the 2008 thing, wow, thats old
21:04 ^🔗	yipdw	unless I can control the Ruby packages (e.g. for production environments) I use https://rvm.beginrescueend.com/
21:05 ^🔗	yipdw	it bypasses package management, but for me, the benefit outweighs that cost
21:07 ^🔗	bsmith093	got rvm now, grabbing ruby 1.9.3
21:07 ^🔗	bsmith093	should i dump the ubuntu repo ruby?
21:07 ^🔗	yipdw	only if you want to, it's not necessary
21:07 ^🔗	yipdw	to dump it
21:08 ^🔗	bsmith093	k then, will this install it like a normal package?
21:08 ^🔗	yipdw	RVM does not use apt, so no
21:08 ^🔗	ersi	lol @ a language moving so fast you can't package it
21:09 ^🔗	yipdw	it will, however, modify your environment's PATH to work out
21:09 ^🔗	yipdw	ersi: it's not that uncommon
21:09 ^🔗	bsmith093	yeah ive never heard of that
21:09 ^🔗	ersi	sounds more like a dialect, that forks all the time
21:09 ^🔗	yipdw	I actually more often construct development environments directly from upstream than I do via OS packages
21:09 ^🔗	bsmith093	although i must say, this is the smoothest, complex thing i ve ever done
21:10 ^🔗	bsmith093	how do i keep it updated?
21:10 ^🔗	yipdw	ersi: in particular, I've found that following upstream directly pays off for Node.js, factor, and GHC
21:10 ^🔗	yipdw	bsmith093: rvm install [Ruby version]
21:11 ^🔗	bsmith093	so i have to know the version, i have , or the version i want to get?
21:11 ^🔗	yipdw	ersi: also, the syntax and semantics of Ruby don't change that often (although ruby-core has been doing some WTFs in that regard lately)
21:11 ^🔗	yipdw	ersi: the libraries, on the other hand
21:11 ^🔗	yipdw	bsmith093: yes; rvm list will show you those
21:11 ^🔗	bsmith093	oh, wow this is cool, ive also never had this much feedback from a compiler that i could actually follow
21:12 ^🔗	ersi	I'm having a hard time understanding how a 10+year language can move so fast it's bleeding edge all the time
21:12 ^🔗	yipdw	the language itself does not
21:12 ^🔗	yipdw	implementations and libraries do
21:14 ^🔗	bsmith093	you know what would be nice? a dummy package for every linux distro, that does [language]-all, and grabs everything in the repos for that language
21:14 ^🔗	yipdw	that would be infeasibly huge
21:14 ^🔗	bsmith093	how big could that possibly be?
21:14 ^🔗	yipdw	for Ruby alone there's 31,503 libraries
21:15 ^🔗	yipdw	Java would be an order of magnitude larger
21:15 ^🔗	bsmith093	mother of Turing, that's a lot of development
21:15 ^🔗	bsmith093	and to be fair, java mostly take care of it self as it needs to
21:16 ^🔗	bsmith093	keep jvm updated and afaik thats all u need to worry about
21:16 ^🔗	yipdw	Hackage lists, uh
21:17 ^🔗	yipdw	something around 3633 packages for Haskell
21:17 ^🔗	bsmith093	ok, ok, so languages are much bigger that I thought, in their entirety
21:18 ^🔗	yipdw	yeah -- I find that a language is really nothing without its libraries
21:18 ^🔗	yipdw	I mean, sure, you can install an implementation of a language
21:18 ^🔗	yipdw	but it's really pretty useless on its own
21:18 ^🔗	bsmith093	hey another thing, does a sudo operation keep root until its done, or is their a timer somewhere?
21:19 ^🔗	bsmith093	because ive had things crap out asking for rights halfway through
21:19 ^🔗	bsmith093	rubys's done
21:20 ^🔗	bsmith093	annnnd.. same error as last time only this time ruby -v ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]
21:20 ^🔗	yipdw	rvm use 1.9.3
21:21 ^🔗	bsmith093	/home/ben/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require': cannot load such file -- connection_pool (LoadError)
21:21 ^🔗	bsmith093	from /home/ben/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
21:21 ^🔗	bsmith093	from ffgrab.rb:1:in `<main>'
21:21 ^🔗	yipdw	paste me the full terminal outpuyt
21:22 ^🔗	bsmith093	/home/ben/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require': cannot load such file -- connection_pool (LoadError)
21:22 ^🔗	bsmith093	ben@ben-laptop:~/1432483$
21:22 ^🔗	bsmith093	from /home/ben/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
21:22 ^🔗	bsmith093	from ffgrab.rb:1:in `<main>'
21:22 ^🔗	bsmith093	ruby ffgrab.rb
21:22 ^🔗	bsmith093	thats what i get
21:22 ^🔗	yipdw	gem install bundler; bundle install
21:23 ^🔗	yipdw	the Gemfile in the gist repo is a dependency manifest
21:23 ^🔗	yipdw	for Bundler
21:24 ^🔗	bsmith093	i nthought that was important, i kept trying ruby Gemfile on the offchance something would happen, this is not an intuitive lang to install
21:25 ^🔗	bsmith093	Fetching source index for http://rubygems.org/
21:25 ^🔗	bsmith093	now that seems like i would need that for gems, becasue thats where i found connection_pool and girl_friday
21:25 ^🔗	yipdw	Rubygems is a packaging mechanism; bundler's a tool for managing packages
21:25 ^🔗	yipdw	they're related, but Rubygems is independent of Bundler
21:26 ^🔗	bsmith093	well its finding the deps, and indtalling them , so whoo.
21:26 ^🔗	bsmith093	holy crap its running
21:27 ^🔗	yipdw	I'd like to again point out that it doesn't do anything to record its results
21:27 ^🔗	bsmith093	and apparently, its timed itself to 6 decimal places?
21:27 ^🔗	yipdw	times what
21:27 ^🔗	bsmith093	timestamp goes out to seconds.######
21:28 ^🔗	yipdw	that's the default behavior of the Ruby logger library
21:28 ^🔗	yipdw	but, yeah, there's no point in running that as-is for a long time
21:28 ^🔗	bsmith093	man, thats precise
21:28 ^🔗	yipdw	because it doesn't yet actually do anything aside from spit results to the console
21:29 ^🔗	yipdw	I'm not even sure if it handles pages correctly -- I think it does, but I haven't run it long enough to see how they get processed in the queue
21:30 ^🔗	bsmith093	i just had a though, do user profiles show the stories all on one page, regardless of how many there are, cause that might be a help.
21:30 ^🔗	yipdw	possibly, but AFAIK there is no way to get a list of all users
21:31 ^🔗	bsmith093	other than doing my original idea, and yours is much faster and uses less resources over all, on both ends
21:32 ^🔗	yipdw	I can tell you that my method results in a lot of duplicates
21:32 ^🔗	yipdw	in particular, it doesn't yet account for the "last page" link in each story
21:32 ^🔗	yipdw	that will have to be filtered out in the discovery logic
21:33 ^🔗	bsmith093	yeah i dont really have any thoughts for that
21:33 ^🔗	yipdw	it's just more HTML scraping
21:33 ^🔗	bsmith093	although the chapter is just a number appended to the link
21:33 ^🔗	yipdw	not hard, just needs to be done
21:33 ^🔗	bsmith093	the next and back buttons are javascript, i think
21:34 ^🔗	yipdw	https://gist.github.com/705cd333e06178057dec
21:34 ^🔗	yipdw	that's a list of 4,506 story links recovered by ffgrab
21:34 ^🔗	yipdw	well
21:34 ^🔗	yipdw	4506 / 2 roughly
21:34 ^🔗	bsmith093	wait the number before the title, thats the last chapter?
21:34 ^🔗	yipdw	that's a chapter indicator
21:35 ^🔗	bsmith093	so let ffgrab run till its done then grep for dupes and keeep the higest number
21:35 ^🔗	yipdw	I'd rather fix it in the grabber
21:35 ^🔗	yipdw	to ignore it, you'll have to change what stories_and_categories_of does at lines 12-13
21:36 ^🔗	yipdw	I'm not sure what the change is, as I haven't looked at ff.net's page structure close enough to make the discernment
21:36 ^🔗	bsmith093	its still faster that iterating through 10mil semi fake links
21:37 ^🔗	yipdw	I am also suspicious of results like this:
21:37 ^🔗	yipdw	I, [2011-12-07T15:33:31.381205 #75544] INFO -- : Found 0 categories, 0 stories from /book/My_Sweet_Audrina/
21:37 ^🔗	yipdw	in that case, there really are no entries that show up
21:37 ^🔗	yipdw	but any 0/0 results make me suspicious that the script is missing something
21:38 ^🔗	bsmith093	i was right, it is js here <input value=" < Prev " onclick="self.location='/s/7066342/6/The_Same_Will_Never_Happen_to_You'" type="BUTTON"> <select title="chapter navigation" name="chapter" onchange="self.location = '/s/7066342/'+ this.options[this.selectedIndex].value + '/The_Same_Will_Never_Happen_to_You';"><option value="1">1. Such a Shame</option><option value="2">2. Don't Do This</option><option value="3">3. The
21:39 ^🔗	yipdw	that's not the link I was talking about
21:39 ^🔗	yipdw	look at e.g. http://www.fanfiction.net/comic/300
21:39 ^🔗	yipdw	see the Â» link?
21:39 ^🔗	yipdw	that's the link to the last completed chapter
21:39 ^🔗	yipdw	which is the link that the discovery code is picking up (and shouldn't pick up)
21:40 ^🔗	yipdw	there's a few ways to fix that
21:40 ^🔗	bsmith093	hey I never noticed that before
21:41 ^🔗	yipdw	anyway, I need to try to finish up some webapp work at work
21:41 ^🔗	yipdw	which is a shitload of fuck related to the DOM and event propagation
21:41 ^🔗	yipdw	as James Rolfe might put it
21:41 ^🔗	yipdw	brb
21:41 ^🔗	bsmith093	wait so grab fanfiction.net/storyid/1 and that last chapter link, and generate all the rest of the links between them
21:41 ^🔗	bsmith093	yeah work comes first
21:42 ^🔗	bsmith093	lol nice reference
21:42 ^🔗	yipdw	that's one possibility; another possibility is to just have wget-warc follow the links
21:42 ^🔗	bsmith093	thats what i tired it only grabbed 300k files
21:43 ^🔗	bsmith093	wget-warc -mcpke robots off with ua for firefox
21:43 ^🔗	bsmith093	speaking of which, I'm still grabbing poe-news
21:59 ^🔗	emijrp	IA is going to create a collection for Occupy movement http://blog.archive.org/2011/12/07/archive-it-team-encourages-your-contributions-to-the-%E2%80%9Coccupy-movement%E2%80%9D-collection/
21:59 ^🔗	emijrp	but I think that there is no collection for Spanish Revolution or Arab Spring
22:06 ^🔗	emijrp	I have many links to share if IA creates an Archive-It collection. I offered my help some weeks ago.
22:07 ^🔗	emijrp	(I mean about Spanish Rev.)
22:20 ^🔗	DFJustin	you can upload the stuff now and the collection can be made later
22:23 ^🔗	emijrp	I prefer to use the Archive-It system. I don't want to upload a tarball with websites that can be viewed online.
22:24 ^🔗	emijrp	Or 200 gb of videos (i have 6000+) because i cannot with my home connection.
22:24 ^🔗	yipdw	bsmith093: ok, so, I've got a variant of ffgrab recording story IDs in a Redis instance
22:25 ^🔗	emijrp	I'm tired of content being uploaded to IA as huge boxes that cant be viewed easily.
22:27 ^🔗	bsmith093	emijrp: meaning hat, exactly
22:27 ^🔗	bsmith093	huge iso files?
22:28 ^🔗	emijrp	scrapes of forums, blogs hostings, geocities, wikis, yahoo videos
22:28 ^🔗	bsmith093	whats wrong with that?
22:29 ^🔗	DFJustin	it's not great, but it's better to get the stuff backed up in some form first
22:29 ^🔗	emijrp	that you cant use them easily
22:29 ^🔗	bsmith093	IA , afaik, isn't really meant as a mirror, its an archive, of the raw data, meant for historical research purpose s
22:30 ^🔗	bsmith093	im sure there's a script for that somewhere
22:30 ^🔗	bsmith093	besides that, complain to them, not archiveteam.
22:31 ^🔗	emijrp	IA has always offered content in a viewable way (wayback, videos and audio with metadata)
22:31 ^🔗	bsmith093	thers an entire section of ia dedicated to geocities
22:31 ^🔗	emijrp	archiveteam is uploading dozen-GB tarballs with apcked content
22:32 ^🔗	DFJustin	it's just a manpower thing, right now it's all jason can do just to keep up with the tarballs coming in
22:33 ^🔗	bsmith093	i would imagine so, you cant grep through a tarball that i know of, and im sure he's doing the best he can, speaking of which, he's not the only person there doing what he's doing , is he?
22:33 ^🔗	bsmith093	SketchCow: voworkers?
22:33 ^🔗	DFJustin	priority #1 has to be getting things off random people's hard drives and into IA's backup infrastructure so it doesn't just go poof
22:34 ^🔗	bsmith093	true and thats better than nothing by a long shot
22:34 ^🔗	bsmith093	but emijrp has a valid point, there needs to be a way to search through all this crapload of otherwise nearly-useless data
22:34 ^🔗	DFJustin	for sure
22:35 ^🔗	dnova	yeah. you download it, untar it, and look through it
22:35 ^🔗	bsmith093	seriously, is it opossible to search through a remote tarball, cause that would be awsome
22:35 ^🔗	dnova	most of these collections aren't meant for casual browsing, afaik.
22:35 ^🔗	yipdw	bsmith093: curl http://[host][path][file].tar.gz \| gunzip -c
22:35 ^🔗	bsmith093	and some, like the utzoo tapes are slightly damaged and should be repaired
22:36 ^🔗	bsmith093	yip,thats a remote search, asin it doesnt download all of it first
22:36 ^🔗	yipdw	that doesn't download all of it
22:36 ^🔗	emijrp	YES, I going to download the 600GB Geogicites pack to watch a site. The good approach is geocities.ws or the mirrors people created.
22:36 ^🔗	yipdw	it only goes until you terminate the site
22:36 ^🔗	yipdw	er, connection
22:36 ^🔗	Ymgve	http://news.slashdot.org/story/11/12/07/2034200/library-of-congress-to-receive-entire-twitter-archive
22:36 ^🔗	Ymgve	cool
22:37 ^🔗	dnova	emijrp: nobody is stopping YOU from making mirrors
22:37 ^🔗	yipdw	if you want something more sophisticated than that, you need to build an index
22:37 ^🔗	dnova	not everyone can afford to host these things.
22:39 ^🔗	yipdw	hmm, my ff link grabber is still a bit retarded
22:39 ^🔗	yipdw	I, [2011-12-07T16:39:37.370755 #78706] INFO -- : Found 0 categories, 0 stories from /r/6551377/
22:39 ^🔗	yipdw	should ignore these:
22:39 ^🔗	yipdw	I, [2011-12-07T16:39:39.987928 #78706] INFO -- : Found 0 categories, 0 stories from /u/1148547/Hobbit4Lyfe
22:39 ^🔗	yipdw	oh well
22:39 ^🔗	DFJustin	videos can be uploaded to archive.org right now to the community videos collection, and then once there's a bunch of them it should be easy to poke someone and get them to create a collection
22:40 ^🔗	DFJustin	if you don't have bandwidth then recruit some buddies
22:40 ^🔗	yipdw	heh
22:41 ^🔗	yipdw	1.9.2-p290 :015 > b = Redis.new.smembers('stories').map(&:to_i).sort; [b.length, b.min, b.max]
22:41 ^🔗	yipdw	=> [89974, 158, 7617073]
22:41 ^🔗	yipdw	that's a pretty sparsely inhabited space
22:42 ^🔗	yipdw	granted, that doesn't include any of the crossovers etc
22:57 ^🔗	bsmith093	wait 80% full is sparsely inhabited
22:59 ^🔗	bsmith093	afaik, every genre page has its own crossover page, and would it kill somebody to back this script up by sorting the good/bad story ids?
23:00 ^🔗	bsmith093	becasue the lowest story id is 4, if im reading that right, yours says 158

irclogger-viewer