#archiveteam 2012-10-24,Wed

↑back Search

Time	Nickname	Message
01:46 ^🔗	joepie91	btw, SketchCow, I think you may find this useful for keeping track of things: http://www.treesheets.com/
01:47 ^🔗	joepie91	(may also be useful for others, and it runs natively on Linux as well)
02:23 ^🔗	SketchCow	Copied FORTUNECITY/com/meltingpot/com-meltingpot-research-20120405-005316.warc.gz to warc
02:23 ^🔗	SketchCow	alard:
02:23 ^🔗	SketchCow	Checking FORTUNECITY/com/meltingpot/com-meltingpot-gambia-20120401-144041.warc.gz
02:23 ^🔗	SketchCow	Could not decompress warc.gz. gunzip returned 2.
02:23 ^🔗	SketchCow	Copying FORTUNECITY/com/meltingpot/com-meltingpot-gambia-20120401-144041.warc.gz to tar
02:23 ^🔗	SketchCow	So that's good.
02:26 ^🔗	dashcloud	did you see my note about the two Coming Soon items?
02:30 ^🔗	SketchCow	17:41 <@dashcloud> so reading the scrollback, I did a brief check of the items, and I came across Coming Soon, which has one item as WARCS, and there's a second item with a WARC file inside a zipfile
02:30 ^🔗	SketchCow	That, right?
02:30 ^🔗	dashcloud	yes
02:30 ^🔗	SketchCow	The thing 6 lines up?
02:30 ^🔗	dashcloud	sorry!
02:30 ^🔗	SketchCow	Or are you watching joins and parts?
02:30 ^🔗	SketchCow	Because I turned THAT shit off MONTHS ago.
02:30 ^🔗	SketchCow	I'd have gone insane.
02:32 ^🔗	SketchCow	http://archive.org/details/csoon-20111016 this one?
02:32 ^🔗	SketchCow	I see.
02:32 ^🔗	SketchCow	Yes, it's handled. The csoon-* is a WARC of the same
02:32 ^🔗	SketchCow	Good eye, though.
02:33 ^🔗	dashcloud	okay
02:37 ^🔗	joepie91	ok, seriously, I love scantailor
02:37 ^🔗	SketchCow	scantailor fixes everything.
02:37 ^🔗	joepie91	yes, pretty much
02:37 ^🔗	joepie91	comics, books, it does all of it :o
02:37 ^🔗	joepie91	and most of it automated
02:37 ^🔗	joepie91	hell, it pretty much successfully cleaned up a book that was copied on a typewriter
02:38 ^🔗	joepie91	on shitty spotty recycled paper
02:38 ^🔗	SketchCow	As my friend Dan Reetz likes to say, sometimes scantailor unwittingly fixes typesetting errors with books
02:38 ^🔗	joepie91	heh
02:38 ^🔗	SketchCow	Where the plates were off by a millimeter or so
02:38 ^🔗	joepie91	SketchCow: http://aarnist.cryto.net:81/vrijheid2.pdf
02:38 ^🔗	joepie91	is the result
02:38 ^🔗	chronomex	oh yeah I've had books come out less crooked than the original
02:38 ^🔗	joepie91	two pages are missing and I should rescan some pages because they were too fuzzy
02:38 ^🔗	joepie91	but overall it's VERY nice
02:39 ^🔗	joepie91	also, tiff2pdf somehow fucked up the front cover, not sure why :P
02:39 ^🔗	chronomex	that is a nice scan.
02:40 ^🔗	joepie91	yes, yes it is :)
02:40 ^🔗	joepie91	but yeah, a few pages definitely needs fixing
02:40 ^🔗	joepie91	need *
02:42 ^🔗	joepie91	109, for example, is a bit meh
02:43 ^🔗	balrog-	joepie91: tiff2pdf is picky about input tiff
02:43 ^🔗	balrog-	very, very picky
02:47 ^🔗	joepie91	yes, so I've noticed
02:47 ^🔗	joepie91	I suspect there's some color space fuckup or something
02:48 ^🔗	joepie91	what I have noticed that has somewhat surprised me: it's possible to make scans of professional quality on Linux with free software alone
02:48 ^🔗	joepie91	from scan to postprocessed PDF
02:48 ^🔗	joepie91	and reasonably automate-able
02:48 ^🔗	balrog-	joepie91: if you or someone is willing to fix hocr2pdf or write a working alternative, then you can have OCRed too
02:49 ^🔗	balrog-	tesseract produces decent output
02:49 ^🔗	joepie91	what language is it written in?
02:49 ^🔗	balrog-	C
02:49 ^🔗	joepie91	ah, not my thing
02:49 ^🔗	joepie91	though
02:49 ^🔗	joepie91	I may know someone who can do that
02:49 ^🔗	balrog-	but there's hOCR-handling stuff in ruby and iirc in python
02:49 ^🔗	chronomex	ocropus too
02:49 ^🔗	joepie91	will give him a poke :P
02:49 ^🔗	joepie91	right
02:49 ^🔗	balrog-	does ocropus handle hOCR?
02:49 ^🔗	chronomex	idk
02:49 ^🔗	balrog-	the OCR step is mostly good
02:49 ^🔗	balrog-	the tricky part is putting the hOCR into the PDF
02:49 ^🔗	joepie91	speaking of which, a potential nice archiveteam-project: build a fast book scanner with fully automated software 'chain' from scan/photo to OCRed ebook files
02:49 ^🔗	joepie91	make it publicly accessible
02:49 ^🔗	joepie91	"come turn your book into an ebook here for free"
02:49 ^🔗	balrog-	hOCR is the OCRed text in HTML format with tags indicating the location
02:50 ^🔗	joepie91	and at the same time, archive/catalogue the scanned books
02:50 ^🔗	chronomex	http://en.wikipedia.org/wiki/HOCR says yes, ocropus and tesseract both
02:50 ^🔗	balrog-	that's software that OUTPUTS it
02:50 ^🔗	joepie91	basically, IRL archiveteam project
02:50 ^🔗	balrog-	you need something to input it and stuff it into a PDF
02:50 ^🔗	chronomex	yep
02:50 ^🔗	joepie91	balrog-: I'll have a look at it some time soon
02:50 ^🔗	balrog-	http://www.exactcode.com/site/open_source/exactimage/hocr2pdf/
02:51 ^🔗	balrog-	svn.exactcode.de for the code
02:51 ^🔗	joepie91	ok :)
02:51 ^🔗	joepie91	but yeah, balrog-, chronomex, thoughts on IRL bookscanning project?
02:51 ^🔗	balrog-	well, I'd first need a bookscanner
02:51 ^🔗	balrog-	problem is, you don't want to know how many books I have.
02:52 ^🔗	chronomex	you have a lot
02:52 ^🔗	chronomex	got it
02:52 ^🔗	joepie91	well
02:52 ^🔗	joepie91	idk if I pasted this, but I ran across a video of a bookscanner
02:52 ^🔗	joepie91	that would do the job
02:52 ^🔗	joepie91	and I think it should be fairly inexpensive to build
02:52 ^🔗	chronomex	the automatic one with the wedge?
02:52 ^🔗	chronomex	yeah that's cool
02:52 ^🔗	joepie91	yeah
02:52 ^🔗	chronomex	dunno about getting the sensors right down at the tip tho
02:52 ^🔗	joepie91	all you need is basically a strong servo, a compressor, and two scanner units
02:52 ^🔗	joepie91	(I think)
02:52 ^🔗	chronomex	s/servo/stepper/
02:52 ^🔗	joepie91	I suck at terms
02:52 ^🔗	joepie91	stepper, right
02:53 ^🔗	joepie91	terminology*
02:53 ^🔗	joepie91	... wow, that was a self-proving statement lol
02:53 ^🔗	chronomex	you need + and - air
02:53 ^🔗	joepie91	right, I know some people here that can probably do that
02:53 ^🔗	joepie91	and they probably have the parts for it, too
02:54 ^🔗	joepie91	but yeah, it would be sort of epic to just have a book scanner somewhere in a public space, where anyone can scan a book and get the resulting ebook emailed to him
02:54 ^🔗	joepie91	and at the same time have the source files and postprocessed files archived centrally
02:54 ^🔗	joepie91	and judging from the software that is available, that should be fairly easy to automate
02:55 ^🔗	joepie91	but then a camera setup would probably be best
02:55 ^🔗	joepie91	for starters
02:55 ^🔗	joepie91	since the wedge thing is a bit.. large :P
02:56 ^🔗	chronomex	yea
02:56 ^🔗	joepie91	and while the camera bookscanner can run off some kind of battery, that will be tricky for the wedge model
02:56 ^🔗	joepie91	I mean, you could just put the camera bookscanner somewhere outside a mall temporarily
02:57 ^🔗	joepie91	and run it off a battery and local storage
03:01 ^🔗	chronomex	have it spit out usb sticks or something
03:01 ^🔗	chronomex	"insert usb stick or sd card to receive a pdf!"
03:01 ^🔗	joepie91	possible as well
03:01 ^🔗	joepie91	maybe offer both USB and SD for instant ebook
03:02 ^🔗	joepie91	or "give your email and we'll send it at the end of the day" as alternative
03:02 ^🔗	joepie91	since USB sticks and SD cards tend to get lost :P
03:03 ^🔗	joepie91	combine a custom python script using python-imaging-sane or whatever is needed to take webcam pictures (depending on setup)
03:03 ^🔗	joepie91	with postprocessing via scantailor-cli
03:04 ^🔗	joepie91	then tiffcp and tiff2pdf
03:04 ^🔗	joepie91	and optionally calibre to produce a .mobi and .epub
03:10 ^🔗	chronomex	would we trust the user to metadata
03:11 ^🔗	chronomex	I don't trust anyone to metadata unless they're 1) a librarian, 2) super picky, or 3) me
03:11 ^🔗	chronomex	I suppose 2 is redundant
03:11 ^🔗	joepie91	I'd say, let the user give metadata first
03:11 ^🔗	joepie91	then review before final archival
03:11 ^🔗	chronomex	aye
03:11 ^🔗	joepie91	at the end of the day
03:11 ^🔗	joepie91	you have to review anyway
03:11 ^🔗	joepie91	to get rid of any personal markings
03:11 ^🔗	joepie91	owner names, stamps, etc
03:11 ^🔗	chronomex	yeah proofing metadata against a title page is pretty straightforward
03:11 ^🔗	chronomex	no
03:11 ^🔗	chronomex	leave that in
03:12 ^🔗	joepie91	that'll cause an issue for people
03:12 ^🔗	chronomex	hm?
03:12 ^🔗	joepie91	I doubt they'd want their name associated with a scan
03:12 ^🔗	chronomex	oh
03:12 ^🔗	chronomex	tell them not to scan the bookplate then?
03:12 ^🔗	joepie91	that's no use when scanning is automated :P
03:12 ^🔗	chronomex	oh
03:12 ^🔗	joepie91	most people write their name in the inside
03:12 ^🔗	chronomex	ummmmm
03:12 ^🔗	*	chronomex shrugs
03:12 ^🔗	chronomex	I hadn't considered that
03:12 ^🔗	joepie91	you can just blank that out, it's typically not written over any actual book content
03:13 ^🔗	chronomex	true
03:13 ^🔗	joepie91	same for stamps, they're usually on the inside cover
03:13 ^🔗	joepie91	in the blank area
03:13 ^🔗	chronomex	you could offer the scanning person an option to do that themselves
03:13 ^🔗	joepie91	true
03:13 ^🔗	joepie91	but you have to be careful to not introduce too many variables and options
03:14 ^🔗	joepie91	or the whole appeal of an ""ebookify your book here" machine will be gone
03:14 ^🔗	joepie91	it's a tricky thing to average :P
03:15 ^🔗	chronomex	yes
03:18 ^🔗	joepie91	good point: if it requires manual pageturning, people won't do it
03:34 ^🔗	SketchCow	Tried to get one of you guys a keynote for a conference.
03:34 ^🔗	SketchCow	underscor or Chronomex, probably
03:34 ^🔗	SketchCow	They wouldn't go for it
03:35 ^🔗	SketchCow	Mostly because of the way the place works (they vote on the person, not the organization)
03:35 ^🔗	SketchCow	But I tried!
03:35 ^🔗	SketchCow	underscor keynoting would be awwwweeessoommmmee
03:35 ^🔗	SketchCow	They'd not forget THAT
05:54 ^🔗	chronomex	hehe
05:54 ^🔗	chronomex	what organization was this?
06:44 ^🔗	ersi	ArchiveTeam for president!
07:26 ^🔗	joepie91	balrog-, chronomex, good news!
07:26 ^🔗	chronomex	oh yeah?
07:26 ^🔗	joepie91	I wrote a script to fix the tiff2pdf issue
07:26 ^🔗	joepie91	with the discolored PDFs
07:26 ^🔗	joepie91	http://pastie.org/5107570
07:26 ^🔗	chronomex	rad
07:26 ^🔗	joepie91	does a chunked read of a PDF
07:26 ^🔗	joepie91	so it doesn't load all of it in memory at once
07:27 ^🔗	joepie91	and replaces a certain string to fix the issue
07:27 ^🔗	joepie91	and yes, it handles strings on the border between 2 chunks properly :P
07:27 ^🔗	joepie91	if it detects part of the to-be-matched string existing at the end of a chunk
07:27 ^🔗	joepie91	it'll read more to get the rest
07:28 ^🔗	joepie91	so basically, it always loads at most 512kb of data
07:28 ^🔗	joepie91	which means it should be possible to easily process a 1GB PDF if needed
07:28 ^🔗	joepie91	without running out of RAM
07:28 ^🔗	chronomex	oboy
07:28 ^🔗	joepie91	also, I tested it ofc, and it works
07:29 ^🔗	joepie91	thanks to these guys: http://www.asmail.be/msg0055295176.html for the fix :P
07:29 ^🔗	joepie91	I'll be releasing a few scripts for scanning soon anyway
07:29 ^🔗	chronomex	nice
07:30 ^🔗	chronomex	I let archive.org's deriver make my pdfs though ;)
07:30 ^🔗	joepie91	heh
07:30 ^🔗	joepie91	anyway, it also has a simple automation script for scanning
07:30 ^🔗	joepie91	interactive CLI script
07:30 ^🔗	joepie91	you pick the device from a list, enter DPI, width, height
07:30 ^🔗	joepie91	hit enter, and it scans a page
07:31 ^🔗	joepie91	hit enter again, and it scans a page
07:31 ^🔗	joepie91	saving them as incrementing numbers
07:31 ^🔗	joepie91	and a separate script for re-scanning certain pages
07:32 ^🔗	joepie91	so, seems I just finished my first comic book scan: http://aarnist.cryto.net:81/straal2.pdf :D
08:07 ^🔗	SketchCow	http://sphotos-a.xx.fbcdn.net/hphotos-ash3/46201_4497571931862_1789693667_n.jpg
08:09 ^🔗	ersi	SketchCow: gay
08:09 ^🔗	joepie91	hahaha
08:10 ^🔗	joepie91	also, I may have an idea for an ultra-cheap camera-based book scanner... but I'll have to see if the camera I have in mind is suitable.
08:10 ^🔗	joepie91	so... searching through boxes it is
08:24 ^🔗	joepie91	interesting... I actually get pictures of reasonable quality with this camera
08:26 ^🔗	joepie91	after postprocessing: http://i.imgur.com/qtX9w.png
08:29 ^🔗	joepie91	I wonder what kind of pictures I could get from this camera with a bit of optimization
08:52 ^🔗	SmileyG	tht hurts my eyes to look at Â¬_Â¬
08:52 ^🔗	chronomex	joepie91: I bet alignment would help too
09:04 ^🔗	joepie91	chronomex: problem is this is only 640 * 480
09:04 ^🔗	joepie91	and the focus isn't great
09:04 ^🔗	chronomex	oh
09:04 ^🔗	joepie91	because it obviously doesn't have autofocus
09:04 ^🔗	joepie91	this thing should have a photo mode that does 1280x1024 photos
09:04 ^🔗	joepie91	but it's behaving quite strangely
09:04 ^🔗	joepie91	it goes into photo mode, but when I press the button it'll still just make a video
09:04 ^🔗	joepie91	instead of taking a photo
09:04 ^🔗	joepie91	:\|
09:04 ^🔗	joepie91	frustrating
09:05 ^🔗	joepie91	it's this camera: http://www.chucklohr.com/808/C3/index.html
09:05 ^🔗	joepie91	it's an awesome little camera otherwise but it's focused at far objects
09:06 ^🔗	joepie91	so doesn't cope with book text too well :P
10:11 ^🔗	SmileyG	"hope it can help your life safe and happiness" - wut? :D
11:25 ^🔗	joepie91	SmileyG: that's a play on the messages from Chinese eBay sellers
11:25 ^🔗	joepie91	lol
11:35 ^🔗	SmileyG	:D
15:02 ^🔗	SketchCow	So, here we are deep into the WARC transfer of material, either my backhack conversions of previous projects, or the webshots upload.
15:03 ^🔗	SketchCow	I'm now waiting to see if anyone yells about the loading of the data, or the system or anything.
15:03 ^🔗	SketchCow	But looks like we have quite a lot to give them, and who knows.
15:04 ^🔗	godane	i uploaded 2 more linux format dvds this morning
15:04 ^🔗	godane	http://archive.org/details/cdrom-linuxformatmagazine-128
15:05 ^🔗	godane	http://archive.org/details/cdrom-linuxformatmagazine-136
15:05 ^🔗	SketchCow	No need to tell me, godane. I'll get to you on my next sweep of you.
15:05 ^🔗	godane	ok
15:05 ^🔗	godane	i just feel better that my wifi is working again
15:16 ^🔗	underscor	hahaha
15:16 ^🔗	underscor	SketchCow: that would be awesome
15:16 ^🔗	underscor	although
15:16 ^🔗	underscor	I have not a lot of experience speaking
15:28 ^🔗	SketchCow	I'd have coached you.
15:36 ^🔗	underscor	<3
17:30 ^🔗	balrog-	:/ http://www.idigitaltimes.com/articles/12066/20121022/nbc-erases-snl-sketch-digital-archive-copyright.htm
17:43 ^🔗	SketchCow	https://twitter.com/shaneb/status/261159783921483776
17:44 ^🔗	SketchCow	balrog-: Non discussion
18:24 ^🔗	joepie91	balrog-: http://i.imgur.com/GVajj.png
18:24 ^🔗	balrog-	joepie91: yeah I noticed

irclogger-viewer