#archiveteam 2012-08-28,Tue

↑back Search

Time	Nickname	Message
01:39 ^🔗	swebb	Oops. I just realized that my irc logger has been off for the last 3 days. Doah. http://badcheese.com/~steve/atlogs
01:42 ^🔗	tef_	godane: I should have something in ~15 minutes
01:52 ^🔗	tef_	godane: I think i'm done
01:53 ^🔗	godane	:-D
01:53 ^🔗	godane	code please?
01:55 ^🔗	tef_	http://code.hanzoarchives.com/warc-tools/src/2a7976f9e7d7/warclinks.py
01:55 ^🔗	tef_	should handle all sorts of links in warcs (only html though...)
01:55 ^🔗	tef_	handles relative urls too
01:55 ^🔗	tef_	I happened to have a html link extractor using the py stdlib kicking around
01:55 ^🔗	tef_	and it helps I wrote a warc library :-)
01:56 ^🔗	tef_	should be able to do hg clone ... (or grab a tarball)
01:56 ^🔗	tef_	export PYTHONPATH=`pwd`
01:56 ^🔗	tef_	python warclinks.py warc-files....
01:56 ^🔗	tef_	handles gzipped, non gzipped files
01:57 ^🔗	tef_	if you have +6 month old warc files when the wget-warc produced weird files, I can put in a fix in for that, but warc2warc --wget-chunk-fix should sort it
01:57 ^🔗	tef_	it doesn't keep a set of links
01:57 ^🔗	tef_	it could product a list of urls found in the links, that aren't in the warc
01:57 ^🔗	tef_	but you can do warcdump ... \| grep WARC-Target \| cut ...
01:58 ^🔗	godane	found a error
01:58 ^🔗	tef_	any questions? I've only tested it a little
01:58 ^🔗	tef_	ah balls
01:58 ^🔗	tef_	can you pastebin ?
01:59 ^🔗	godane	http://pastebin.com/NfbFUy2Q
01:59 ^🔗	tef_	hrm, it shouldn't be raising that
01:59 ^🔗	tef_	oh i'm a muppet.
01:59 ^🔗	tef_	hrm, you've got some lovely html there :-)
02:00 ^🔗	godane	i know
02:00 ^🔗	godane	this is the first time that grep -ohP doesn't work to grab/filter all urls
02:01 ^🔗	godane	i'm trying to use that to grab all images from sites like techcrunch and such
02:01 ^🔗	tef_	I pushed a fix to skip them properly
02:01 ^🔗	tef_	but I should replace it with something more reliable than python's built in parser
02:01 ^🔗	tef_	maybe I should use beautiful soup or lxml
02:02 ^🔗	tef_	but it will get you some of the urls, maybe, I hope, :-)
02:05 ^🔗	tef_	ugh
02:05 ^🔗	tef_	I am an idiot
02:05 ^🔗	tef_	anyway, I'm gonna try and put beautiful soup in
02:05 ^🔗	tef_	should handle everything
02:06 ^🔗	godane	ok
02:06 ^🔗	tef_	rather than committing typos :3
02:29 ^🔗	tef_	godane: pushed
02:29 ^🔗	tef_	should use lxml
02:29 ^🔗	tef_	well almost pushed
02:29 ^🔗	tef_	pushed now
02:30 ^🔗	tef_	godane: ping
02:30 ^🔗	godane	hey
02:30 ^🔗	godane	i got it
02:31 ^🔗	godane	there looks be warnings of parse error
02:31 ^🔗	tef_	hrm
02:32 ^🔗	tef_	you may need into install lxml, via python-lxml (apt) or easy_install lxml
02:32 ^🔗	godane	you didn't fix my problem
02:32 ^🔗	godane	the lines still break
02:33 ^🔗	godane	but this does look better and has more stuff in it now
02:33 ^🔗	tef_	just fixing a bug
02:33 ^🔗	tef_	well, maybe a bug
02:34 ^🔗	tef_	godane: how are you running it, I get a whole slew of urls from the examples I try
02:36 ^🔗	underscor	tef_: are you the tef that recently visited #hackerfurs?
02:36 ^🔗	godane	python warclinks groklaw.net-articles-2006.warc.gz > log
02:36 ^🔗	godane	*warclinks.py
02:37 ^🔗	tef_	underscor: yeah, I got dragged in by mithaldu
02:37 ^🔗	tef_	I heard some furries were trash talking my code :-)
02:37 ^🔗	tef_	I assume you're the same underscor there
02:38 ^🔗	tef_	what's the lines still break thing ?
02:38 ^🔗	tef_	hrm
02:38 ^🔗	godane	like i said
02:39 ^🔗	tef_	i'm slow :3
02:39 ^🔗	godane	this warc.gz is special
02:39 ^🔗	tef_	oh so special
02:39 ^🔗	tef_	i'd ask for a copy but I assume It's huge
02:39 ^🔗	godane	no just ~15mb
02:41 ^🔗	underscor	tef_: haha, yeah
02:41 ^🔗	tef_	small world, innit
02:41 ^🔗	tef_	I backed out cos well, I had a clearout of irssi windows
02:42 ^🔗	underscor	Aye
02:46 ^🔗	godane	tef_: you can download it here: http://archive.org/details/groklaw.net-articles-2006-20120827-mirror
02:47 ^🔗	tef_	godane: fetching now
02:50 ^🔗	tef_	oh wow
02:51 ^🔗	godane	you see what i mean now
02:51 ^🔗	godane	even doing a tr -d '\n' does nothing to it
02:52 ^🔗	tef_	yeah
02:52 ^🔗	tef_	that is rather amazing
02:54 ^🔗	tef_	pushed a fix :3
02:54 ^🔗	tef_	godane: try now
02:54 ^🔗	tef_	I can also try stripping fragments too, but I think sed can fix that
02:55 ^🔗	godane	lots of errors now
02:55 ^🔗	tef_	hrm ? I get a bunch of links out
02:56 ^🔗	tef_	did python warclinks.py ~/Downloads/groklaw.net-articles-2006.warc.gz \|sort\|uniq
02:56 ^🔗	tef_	and without newlines and such
02:56 ^🔗	tef_	try repulling incase something weird happened
02:57 ^🔗	godane	file "warclinks.py", line 64, in extract_links_from_warcfh
02:58 ^🔗	godane	there error i have is your fix
02:58 ^🔗	tef_	hrm
02:58 ^🔗	tef_	do you have a little bit more of that error ?
02:59 ^🔗	tef_	it parses on mine, what version of python are you using ?
02:59 ^🔗	godane	yield link.translate(None, '\n\r\t')
02:59 ^🔗	godane	i'm using python2
02:59 ^🔗	Coderjoe	2.6 or 2.7?
02:59 ^🔗	tef_	can you paste the entire traceback
02:59 ^🔗	godane	2.7.3
02:59 ^🔗	godane	i can't right now
02:59 ^🔗	tef_	baws
02:59 ^🔗	godane	i'm on firefox proxy
02:59 ^🔗	tef_	can you copy and pase the error message at least?
03:00 ^🔗	tef_	rather than just the line
03:00 ^🔗	tef_	which exception
03:00 ^🔗	tef_	as it works on my machine (tm)
03:02 ^🔗	tef_	http://secretvolcanobase.org/~tef/warc_links.txt.gz example output
03:03 ^🔗	godane	http://pastebin.com/NnaN79q1
03:04 ^🔗	tef_	2.7.3 weeerid
03:04 ^🔗	tef_	http://docs.python.org/library/stdtypes.html#str.translate
03:04 ^🔗	tef_	cos it says two arguments here
03:05 ^🔗	tef_	anyway, the txt.gz file has the links you want, I hope
03:05 ^🔗	tef_	hrm
03:05 ^🔗	tef_	aaaaha
03:05 ^🔗	tef_	for some reason on your machine it is sending in unicode
03:08 ^🔗	tef_	godane: pull or try the output provided
03:10 ^🔗	godane	thank you
03:11 ^🔗	tef_	fixed?
03:11 ^🔗	godane	yes
03:11 ^🔗	tef_	\o/
03:11 ^🔗	godane	i think
03:12 ^🔗	tef_	well that took longer than 15 minutes :3
03:12 ^🔗	tef_	what an awful warc file
04:00 ^🔗	godane	looks like that warc had 700+mb of pdfs, mp3, ogg, and images from groklaw.net
04:11 ^🔗	godane	there is a error again
04:12 ^🔗	godane	tef_: ping ^
07:48 ^🔗	alard	Similarly, it might be useful to disable proxy_buffering if it's enabled. That can also be done from the script with an extra HTTP header in the response, if that's easier.
07:48 ^🔗	alard	underscor: Thanks for the warctozip update. Although the new POST things don't really work: your Nginx config apparently has a very low client_max_body_size. Perhaps you can increase that a bit? (It would be even nicer if it didn't buffer the request at all, but that seems to be impossible with Nginx.)
09:22 ^🔗	Schbirid	thanks for the Aktuelles Software Magazine collection!
09:36 ^🔗	Schbirid	does someone have/know a tool to completely download a reddit thread? the increments when you click "more" get tiny, so it is quite annoying to do by hand
09:37 ^🔗	ersi	it's called a scripting language, and it's a very sharp tool
09:37 ^🔗	ersi	^_^
09:38 ^🔗	ersi	Wonder how they do the comment collapsing, should take a look at that sometime
09:39 ^🔗	Schbirid	same would be handy for facebook, those threads are nearly impossible to get with a browser since they cant keep up rendering thousands of comments
09:40 ^🔗	alard	Wget+Lua!
09:40 ^🔗	*	Schbirid runs away
09:41 ^🔗	ersi	Ooh, should take a looksie at wget+lua sometime as well
10:49 ^🔗	tef_	godane: ?
13:16 ^🔗	godane	tef_: hey
13:16 ^🔗	godane	i'm back
13:16 ^🔗	godane	it looks like some keys have problems with unicode
13:16 ^🔗	godane	like 0x94
13:16 ^🔗	godane	and 0x31
13:17 ^🔗	tef_	hrm
13:45 ^🔗	SketchCow	I just asked archive.org a question about scanning.
13:45 ^🔗	SketchCow	Can we have a volunteer corps of people in the SF Bay area who come in and operate a bookscanner assigned to our group, who then scan computer historical documents.
13:46 ^🔗	SketchCow	If they say yes, I'll start harassing people about joining up.
13:51 ^🔗	tef_	godane: put in a better fix, maybe
14:57 ^🔗	underscor	http://want.archive.org/
14:57 ^🔗	underscor	alard: that will go through the load balancer instead of running on my dev box, if you want to update the demo app
15:39 ^🔗	SketchCow	underscor: Please add a line under "currently only for books/things with ISBNs"
15:39 ^🔗	SketchCow	Experimental: Do not use as a sign-off for large donations of books. Please contact info@archive.org.
15:39 ^🔗	SketchCow	Remove secret mode line
15:50 ^🔗	godane	i got over 8gb of groklaw.org
15:50 ^🔗	godane	:-D
15:51 ^🔗	godane	i do have split some the warc.gz cause downloads stop sometimes
15:52 ^🔗	godane	it maybe closer to 4gb cause i have the mirror .tar.gz and .warc.gz
15:56 ^🔗	alard	underscor: My want-it demo app is asleep, I don't know if I will wake it up again. (I ran the human.io app on my home computer.)
15:56 ^🔗	alard	Also, the want-it api is also visible on http://warctozip.archive.org/ ?
16:15 ^🔗	tef_	godane: did the most recent fix, well, uh fix
16:15 ^🔗	godane	i don't know
16:15 ^🔗	tef_	heh
16:16 ^🔗	godane	i see the error again with my groklaw.net 2011 dump
16:16 ^🔗	tef_	godane: yeah I'm not sure why your lxml is returning unicode
16:17 ^🔗	godane	i think its mostly cause groklaw is special
16:17 ^🔗	godane	i also get some bad urls like this: http://www.groklaw.net/htt[://www.groklaw.net/pdf3/LodsysvCombay-26.pdf
16:18 ^🔗	godane	luckly all bad urls on the top of the list
16:18 ^🔗	tef_	heh
16:18 ^🔗	tef_	yeah I can't fix their broken links
16:19 ^🔗	godane	the thing is i checked for that file
16:19 ^🔗	tef_	pushing a better check for unicode for what it is worth
16:19 ^🔗	tef_	either way I hope you've got more stuff than you would have had without it
16:19 ^🔗	tef_	despite it being buggy and crap :-)
16:20 ^🔗	godane	it has that same broke line problem from what i can tell
16:22 ^🔗	tef_	baws
16:23 ^🔗	tef_	I'm not going to have a lot of time, if any to keep playing hunt the bug when I'm struggling to recreate some of the weirder errors
16:23 ^🔗	tef_	sorry :/
16:23 ^🔗	godane	thats ok
16:24 ^🔗	godane	it filters out the bad urls better then before
16:24 ^🔗	godane	and i think it does fix most of the bad urls
16:27 ^🔗	tef_	yay :D
16:27 ^🔗	tef_	you might find google refine will be good for cleaning up large data sets like this
16:39 ^🔗	DFJustin	<Zuu_> I have a website that I would like to be archived, how would I do so?
16:39 ^🔗	DFJustin	<Zuu_> it's going down saturday sometime, i'll just leave this here: http://www.therevoltpress.org/
16:39 ^🔗	DFJustin	did anyone do this
16:39 ^🔗	DFJustin	godane was disconnected at the time
16:41 ^🔗	Patt	it looks like the website is still up
16:47 ^🔗	godane	i will try to grab it soon
16:48 ^🔗	godane	my groklaw.net grab is very special so i don't want it to stop downloading
16:48 ^🔗	Patt	godane, let me know when/where you download it when your done please
16:50 ^🔗	godane	good news is it doesn't look like it was updated since last year
16:53 ^🔗	godane	but there boards have been busy
16:53 ^🔗	Patt	yea, it will be until it closes
16:53 ^🔗	Patt	no ETA though
16:53 ^🔗	SketchCow	want.archive.org is apparently going to shift names, so don't get comfy with it. :)
17:12 ^🔗	godane	i have to login with a user name and password
17:13 ^🔗	godane	how do you do that with wget?
17:14 ^🔗	alard	godane: HTTP basic authentication? wget --help \| grep user
17:15 ^🔗	Patt	godane, you can login with anonymous / anonymous
17:15 ^🔗	Patt	btw
17:24 ^🔗	godane	i'm get this for cookie:
17:24 ^🔗	godane	therevoltpress.org FALSE / FALSE 1377710618 bblastactivity 0
17:24 ^🔗	godane	therevoltpress.org FALSE / FALSE 1377710618 bblastvisit 1346174618
17:24 ^🔗	godane	its not working
17:24 ^🔗	godane	stupid me
17:24 ^🔗	godane	wrong url
17:25 ^🔗	godane	still doesn't work
17:25 ^🔗	godane	therevoltpress.org FALSE / FALSE 1377710728 bblastactivity 0
17:25 ^🔗	godane	therevoltpress.org FALSE / FALSE 1377710728 bblastvisit 1346174728
17:28 ^🔗	godane	i don't think i can mirror it
17:35 ^🔗	godane	what am i doing wrong here:
17:35 ^🔗	godane	://therevoltpress.org/boards/" --keep-session-cookies --load-cookies=cookies1.tx
17:35 ^🔗	godane	cdx
17:35 ^🔗	godane	t --content-disposition --mirror --warc-file=therevoltpress.org-20120828 --warc-
17:35 ^🔗	godane	wget "http
17:43 ^🔗	godane	can anyone help me?
17:43 ^🔗	godane	its driving me nuts
17:44 ^🔗	godane	cause i have no idea on how to add cookies to wget the right way
17:48 ^🔗	balrog_	godane: do you have a cookies.txt?
17:48 ^🔗	balrog_	and is it properly formatted?
17:48 ^🔗	godane	yes
17:48 ^🔗	godane	its just like the other ones
17:48 ^🔗	godane	i'm using export cookies addon for firefox to get the cookie
17:49 ^🔗	godane	i may not know where to point it to through
17:49 ^🔗	godane	cause therevoltpress.org/boards/ is not working with wget
17:49 ^🔗	godane	even therevoltpress.org/boards/login.php doesn't work
17:52 ^🔗	alard	-U "Somethingelse." ?
17:52 ^🔗	alard	They may be blocking wget.
17:53 ^🔗	godane	that didn't work
17:56 ^🔗	godane	there using vBulletin 3.8.0 if that helps
17:58 ^🔗	godane	this maybe better for you guys to do it
17:58 ^🔗	godane	i can't do much here
17:58 ^🔗	godane	and even if i could get all of it i maybe more then 10gb
17:59 ^🔗	godane	and i don't think i can get the uploaded on my internet speed
18:08 ^🔗	balrog_	godane: that's worked for me...
18:08 ^🔗	balrog_	are you faking the UA?
18:08 ^🔗	balrog_	I had to for one project
18:15 ^🔗	godane	yes
18:15 ^🔗	godane	show me your code please?
18:15 ^🔗	godane	and send me your cookie
18:15 ^🔗	godane	i getting false / false with my cookies for some reasone
18:19 ^🔗	godane	balrog_: can please sead me the code?
18:19 ^🔗	godane	i'm dieing here
18:30 ^🔗	godane	wget "http
18:30 ^🔗	godane	://therevoltpress.org/boards/login.php?do=login" --mirror --warc-file=therevoltp
18:30 ^🔗	godane	ress.org-20120828 --warc-cdx -U "ArchiveTeam" --load-cookies=cookies1.txt
18:30 ^🔗	godane	thats my code
18:30 ^🔗	godane	you show my yours?
18:30 ^🔗	godane	*me
18:31 ^🔗	godane	or at least tell me the url your using
18:34 ^🔗	godane	balrog_: where the hell are you?
18:35 ^🔗	balrog_	busy, stuck at work
18:35 ^🔗	godane	can you please help me?
18:35 ^🔗	godane	i don't know why this site will not download
18:35 ^🔗	godane	and i don't know how the hell to save the cookies through wget anymore
18:36 ^🔗	balrog_	what's in cookies1.txt before you start?
18:36 ^🔗	godane	therevoltpress.org FALSE / FALSE 1377696171 bblastvisit 1346174589
18:36 ^🔗	godane	therevoltpress.org FALSE / FALSE 1377696451 bblastactivity 0
18:36 ^🔗	godane	www.therevoltpress.org FALSE / FALSE 0 __utmc 1
18:36 ^🔗	godane	www.therevoltpress.org FALSE / FALSE 1346159957 __utmb 1.2.10.1346158150
18:36 ^🔗	godane	www.therevoltpress.org FALSE / FALSE 1361926157 __utmz 1.1346158150.1.1.utmcsr=(direct)\|utmccn=(direct)\|utmcmd=(none)
18:36 ^🔗	godane	www.therevoltpress.org FALSE / FALSE 1409230157 __utma 1.882311859.1346158150.1346158150.1346158150.1
18:36 ^🔗	godane	therevoltpress.org FALSE / FALSE 0 bbsessionhash a11c86836d5471bdda445db209cb2e5a
18:36 ^🔗	godane	thats all my therevoltpress.org cookies
18:37 ^🔗	godane	i have no idea why there not working
18:39 ^🔗	godane	Patt: any ideas on how to mirror therevoltpress.org
18:39 ^🔗	godane	Patt: remember you asked for me by name
18:40 ^🔗	underscor	Alard: fixed. Thanks.
20:58 ^🔗	SketchCow	alard: When could we make the memac search public?
21:00 ^🔗	alard	Hadn't you already done that?
21:01 ^🔗	alard	I think it won't get more complete than it is now. The .zip download links work. It's a pity the .warc.gz download links don't work, but I think that's an issue with the archive.org tarviewer.
21:11 ^🔗	SketchCow	Well, I'm about to give it to a press person
21:11 ^🔗	SketchCow	So if it can be set up as ready to go for press, let's do it.
21:13 ^🔗	chronomex	the fixed-width font will scare muggles
21:14 ^🔗	chronomex	I'm all for it
21:17 ^🔗	SketchCow	WHY MUST YOU SELL FEAR
21:17 ^🔗	SketchCow	+1 for "muggles"
21:18 ^🔗	SketchCow	Always amazed how that one goes by
21:18 ^🔗	SketchCow	Treats them like cattle
21:18 ^🔗	SketchCow	Also liked how one book basically had magic dude show up in prime minister's office going "major shit going down lol brb"
21:18 ^🔗	chronomex	heh
21:19 ^🔗	alard	Yeah, so, well, the search page is here: http://archive.org/download/archiveteam-mobileme-index/mobileme-20120817.html
21:20 ^🔗	alard	It may or may not need a lot of text and explanations.
21:20 ^🔗	alard	Why am I not here? Why am I here? How did you hack my account?
21:20 ^🔗	chronomex	hm
21:21 ^🔗	chronomex	how the fuzz does this work anyway
21:21 ^🔗	alard	"Email complaints@archiveteam.org to get your things removed."
21:21 ^🔗	chronomex	ahhh
21:21 ^🔗	alard	It's just a 400MB JSON file sorted alphabetically.
21:21 ^🔗	chronomex	that's tricky
21:22 ^🔗	chronomex	not worthwhile to split it up?
21:22 ^🔗	soultcer	So searching requires me to download a 400 MB file?
21:22 ^🔗	alard	No, you just download small bits of it.
21:22 ^🔗	chronomex	ah, cool
21:22 ^🔗	soultcer	Magic
21:22 ^🔗	chronomex	it does some sort of binary windowing thing?
21:23 ^🔗	alard	https://ia600403.us.archive.org/30/items/archiveteam-mobileme-index/
21:23 ^🔗	alard	There's an index to the large json file, with the locations of where items start.
21:23 ^🔗	chronomex	hot diggity damn
21:24 ^🔗	alard	Because it's sorted, you know that the item X should be in bytes n-m.
21:24 ^🔗	alard	(If that's abstract enough.)
21:24 ^🔗	chronomex	hangs infinitely in opera
21:25 ^🔗	alard	Does it.
21:25 ^🔗	chronomex	yurp
21:25 ^🔗	alard	Any idea why?
21:25 ^🔗	*	chronomex shrugs
21:25 ^🔗	chronomex	opera's weird
21:25 ^🔗	alard	I tried it in Firefox and Chrome.
21:25 ^🔗	chronomex	yeah, works fine in chromei
21:26 ^🔗	alard	It's a bit tricky, so you need a modern browser. But it doesn't need a database.
21:26 ^🔗	chronomex	it's spiffy
21:26 ^🔗	chronomex	I like it
21:27 ^🔗	chronomex	this is the future
21:28 ^🔗	alard	It's the past. It's just a horribly slow search engine that can only search on one key.
21:28 ^🔗	alard	It's fast enough to be usable, though.
21:28 ^🔗	chronomex	yeah
21:29 ^🔗	chronomex	https://ia600403.us.archive.org/30/items/archiveteam-mobileme-index/mobileme-20120817.html#chronomex hah, I suppose I put my own name through the script at some point
21:31 ^🔗	alard	We're flooding the channel. :)
21:33 ^🔗	ersi	take it to #internetarchive, you!
21:33 ^🔗	ersi	or #nowwhat :D or.. -bs
21:34 ^🔗	ersi	endless possibilities
21:34 ^🔗	alard	We should have a hash function where you can enter a topic and it'll tell you to go to #archiveteam-${hash}
21:35 ^🔗	alard	Let's go to #nowwhat
21:35 ^🔗	ersi	or just a stab at random
21:37 ^🔗	alard	We'll just change channels after every second message. That's what real hackers do, I've heard.
21:39 ^🔗	closure	7 layers of channels
21:57 ^🔗	alard	Installed Opera, found the problem: Opera is stupid, it doesn't do Range: headers in XmlHttpRequest, so it starts downloading the full 400MB.
21:58 ^🔗	alard	(It also opens connections to ebay, booking.com and other sites, without my asking so.)
21:59 ^🔗	alard	SketchCow: Anything else you need to make the search thing ready to go to press?
21:59 ^🔗	SketchCow	http://archive.org/download/archiveteam-mobileme-index/mobileme-20120817.html is what we go with, right?
21:59 ^🔗	alard	Yes. It's possible to put it in an iframe somewhere on archiveteam.org, if that's better.
22:02 ^🔗	dashcloud	want.archive.org sounds great- how do you get books to IA? (is there going to be a blog post somewhere on this? or is it not public-ready yet?)
22:02 ^🔗	SketchCow	Not public ready
22:02 ^🔗	SketchCow	But you basically mail them books. I send mine in crates, media mail.
22:02 ^🔗	SketchCow	200 went out today
22:03 ^🔗	SketchCow	archive.org wants to take it under consideration before it becomes an official API
22:03 ^🔗	chronomex	I'd love to unload some books, I have way too many for a single man in a city :(
22:04 ^🔗	chronomex	I'll do an inventory eventually
22:05 ^🔗	soultcer	chronomex: Check out bookmooch.com, it allows you to trade books by mail
22:05 ^🔗	chronomex	meh, that sounds like a lot of work
22:06 ^🔗	chronomex	also I have too many books
22:06 ^🔗	chronomex	I should scan the rare ones.
22:06 ^🔗	dashcloud	I do as well- I've had to switch to ebooks because I don't really have more room for physical copies
22:07 ^🔗	chronomex	the space under my bed is about 80% books.
22:07 ^🔗	dashcloud	every shelf is full of books, and nearly the entire wall is lined with piles of books
22:09 ^🔗	dashcloud	I'd love to do the book scanning thing, but it takes a more disciplined and dedicated person than me to do that- I'd get distracted by reading parts of the pages as I flipped by, and it's a lot more tedious flipping pages and taking pictures than reading the book
22:09 ^🔗	DFJustin	haha I'm not the only one
22:11 ^🔗	chronomex	I've only scanned one in toto, which is probably the most valuable book I own - http://archive.org/details/TheElectronicSwitchingSystem
22:11 ^🔗	dashcloud	that instructable on the cardboard box bookscanner makes the whole thing look easy, but apart from the aforemention issues, there's the post processing of each page- which is SO much easier if your pictures are uniform in each respect
22:12 ^🔗	SketchCow	This is why we're working on want.archive.org
22:12 ^🔗	SketchCow	Send them to archive.org, they get scanned in
22:12 ^🔗	DFJustin	I used to scan books on a flatbed for distributed proofreaders, you kids and your diy things
22:14 ^🔗	chronomex	DFJustin: gutenberg?
22:14 ^🔗	DFJustin	yeah
22:14 ^🔗	DFJustin	unfortunately the raw scans all ate it in an hdd crash, unless dp still has them
22:15 ^🔗	chronomex	:( ): :(
22:15 ^🔗	DFJustin	the pg guys made some wicked ebook editions though http://www.gutenberg.org/files/16410/16410-h/16410-h.htm
22:15 ^🔗	dashcloud	that's a great idea, except making space is only half the reason I'm scanning a book- the other is to have an ebook version of it (which I'm pretty sure I can't get from archive.org- books are too new)
22:15 ^🔗	chronomex	DFJustin: oh that's sexy
22:16 ^🔗	chronomex	I got 2/3 of the way through TeXifying that book too - http://gir.seattlewireless.net/~chronomex/bellsystem/morris/Morris.html
22:17 ^🔗	dashcloud	if you tell me I can get an electronic copy of every book I mail into IA, I'd crate a large part of books and send them very quickly
22:17 ^🔗	chronomex	yeah.
22:17 ^🔗	DFJustin	it's not legal since they want to lend out the electronic copy
22:18 ^🔗	chronomex	yeah :S
22:19 ^🔗	dashcloud	the other scanning project you proposed to archive.org sounds great as well- the historical computer document one
22:47 ^🔗	SketchCow	I made the formal proposal to archive.org about that
23:26 ^🔗	Coderjoe	I still would like a DIY bookscanner :D
23:45 ^🔗	DFJustin	wasn't SketchCow supposed to get one of those like 6+ months ago and CHANGE COMPUTER HISTORY
23:45 ^🔗	SketchCow	Yes
23:45 ^🔗	SketchCow	I've been needling the guy - little response
23:46 ^🔗	SketchCow	I've got a few "Getting that right to you (six months ago)" so I'm not going to get too het up

irclogger-viewer