[01:46] btw, SketchCow, I think you may find this useful for keeping track of things: http://www.treesheets.com/ [01:47] (may also be useful for others, and it runs natively on Linux as well) [02:23] Copied FORTUNECITY/com/meltingpot/com-meltingpot-research-20120405-005316.warc.gz to warc [02:23] alard: [02:23] Checking FORTUNECITY/com/meltingpot/com-meltingpot-gambia-20120401-144041.warc.gz [02:23] Could not decompress warc.gz. gunzip returned 2. [02:23] Copying FORTUNECITY/com/meltingpot/com-meltingpot-gambia-20120401-144041.warc.gz to tar [02:23] So that's good. [02:26] did you see my note about the two Coming Soon items? [02:30] 17:41 <@dashcloud> so reading the scrollback, I did a brief check of the items, and I came across Coming Soon, which has one item as WARCS, and there's a second item with a WARC file inside a zipfile [02:30] That, right? [02:30] yes [02:30] The thing 6 lines up? [02:30] sorry! [02:30] Or are you watching joins and parts? [02:30] Because I turned THAT shit off MONTHS ago. [02:30] I'd have gone insane. [02:32] http://archive.org/details/csoon-20111016 this one? [02:32] I see. [02:32] Yes, it's handled. The csoon-* is a WARC of the same [02:32] Good eye, though. [02:33] okay [02:37] ok, seriously, I love scantailor [02:37] scantailor fixes everything. [02:37] yes, pretty much [02:37] comics, books, it does all of it :o [02:37] and most of it automated [02:37] hell, it pretty much successfully cleaned up a book that was copied *on a typewriter* [02:38] on shitty spotty recycled paper [02:38] As my friend Dan Reetz likes to say, sometimes scantailor unwittingly fixes typesetting errors with books [02:38] heh [02:38] Where the plates were off by a millimeter or so [02:38] SketchCow: http://aarnist.cryto.net:81/vrijheid2.pdf [02:38] is the result [02:38] oh yeah I've had books come out less crooked than the original [02:38] two pages are missing and I should rescan some pages because they were too fuzzy [02:38] but overall it's VERY nice [02:39] also, tiff2pdf somehow fucked up the front cover, not sure why :P [02:39] that is a nice scan. [02:40] yes, yes it is :) [02:40] but yeah, a few pages definitely needs fixing [02:40] need * [02:42] 109, for example, is a bit meh [02:43] joepie91: tiff2pdf is picky about input tiff [02:43] very, very picky [02:47] yes, so I've noticed [02:47] I suspect there's some color space fuckup or something [02:48] what I have noticed that has somewhat surprised me: it's possible to make scans of professional quality on Linux with free software alone [02:48] from scan to postprocessed PDF [02:48] and reasonably automate-able [02:48] joepie91: if you or someone is willing to fix hocr2pdf or write a working alternative, then you can have OCRed too [02:49] tesseract produces decent output [02:49] what language is it written in? [02:49] C [02:49] ah, not my thing [02:49] though [02:49] I may know someone who can do that [02:49] but there's hOCR-handling stuff in ruby and iirc in python [02:49] ocropus too [02:49] will give him a poke :P [02:49] right [02:49] does ocropus handle hOCR? [02:49] idk [02:49] the OCR step is mostly good [02:49] the tricky part is putting the hOCR into the PDF [02:49] speaking of which, a potential nice archiveteam-project: build a fast book scanner with fully automated software 'chain' from scan/photo to OCRed ebook files [02:49] make it publicly accessible [02:49] "come turn your book into an ebook here for free" [02:49] hOCR is the OCRed text in HTML format with tags indicating the location [02:50] and at the same time, archive/catalogue the scanned books [02:50] http://en.wikipedia.org/wiki/HOCR says yes, ocropus and tesseract both [02:50] that's software that OUTPUTS it [02:50] basically, IRL archiveteam project [02:50] you need something to input it and stuff it into a PDF [02:50] yep [02:50] balrog-: I'll have a look at it some time soon [02:50] http://www.exactcode.com/site/open_source/exactimage/hocr2pdf/ [02:51] svn.exactcode.de for the code [02:51] ok :) [02:51] but yeah, balrog-, chronomex, thoughts on IRL bookscanning project? [02:51] well, I'd first need a bookscanner [02:51] problem is, you don't want to know how many books I have. [02:52] you have a lot [02:52] got it [02:52] well [02:52] idk if I pasted this, but I ran across a video of a bookscanner [02:52] that would do the job [02:52] and I think it should be fairly inexpensive to build [02:52] the automatic one with the wedge? [02:52] yeah that's cool [02:52] yeah [02:52] dunno about getting the sensors right down at the tip tho [02:52] all you need is basically a strong servo, a compressor, and two scanner units [02:52] (I think) [02:52] s/servo/stepper/ [02:52] I suck at terms [02:52] stepper, right [02:53] terminology* [02:53] ... wow, that was a self-proving statement lol [02:53] you need + and - air [02:53] right, I know some people here that can probably do that [02:53] and they probably have the parts for it, too [02:54] but yeah, it would be sort of epic to just have a book scanner somewhere in a public space, where anyone can scan a book and get the resulting ebook emailed to him [02:54] and at the same time have the source files and postprocessed files archived centrally [02:54] and judging from the software that is available, that should be fairly easy to automate [02:55] but then a camera setup would probably be best [02:55] for starters [02:55] since the wedge thing is a bit.. large :P [02:56] yea [02:56] and while the camera bookscanner can run off some kind of battery, that will be tricky for the wedge model [02:56] I mean, you could just put the camera bookscanner somewhere outside a mall temporarily [02:57] and run it off a battery and local storage [03:01] have it spit out usb sticks or something [03:01] "insert usb stick or sd card to receive a pdf!" [03:01] possible as well [03:01] maybe offer both USB and SD for instant ebook [03:02] or "give your email and we'll send it at the end of the day" as alternative [03:02] since USB sticks and SD cards tend to get lost :P [03:03] combine a custom python script using python-imaging-sane or whatever is needed to take webcam pictures (depending on setup) [03:03] with postprocessing via scantailor-cli [03:04] then tiffcp and tiff2pdf [03:04] and optionally calibre to produce a .mobi and .epub [03:10] would we trust the user to metadata [03:11] I don't trust anyone to metadata unless they're 1) a librarian, 2) super picky, or 3) me [03:11] I suppose 2 is redundant [03:11] I'd say, let the user give metadata first [03:11] then review before final archival [03:11] aye [03:11] at the end of the day [03:11] you have to review anyway [03:11] to get rid of any personal markings [03:11] owner names, stamps, etc [03:11] yeah proofing metadata against a title page is pretty straightforward [03:11] no [03:11] leave that in [03:12] that'll cause an issue for people [03:12] hm? [03:12] I doubt they'd want their name associated with a scan [03:12] oh [03:12] tell them not to scan the bookplate then? [03:12] that's no use when scanning is automated :P [03:12] oh [03:12] most people write their name in the inside [03:12] ummmmm [03:12] * chronomex shrugs [03:12] I hadn't considered that [03:12] you can just blank that out, it's typically not written over any actual book content [03:13] true [03:13] same for stamps, they're usually on the inside cover [03:13] in the blank area [03:13] you could offer the scanning person an option to do that themselves [03:13] true [03:13] but you have to be careful to not introduce too many variables and options [03:14] or the whole appeal of an ""ebookify your book here" machine will be gone [03:14] it's a tricky thing to average :P [03:15] yes [03:18] good point: if it requires manual pageturning, people won't do it [03:34] Tried to get one of you guys a keynote for a conference. [03:34] underscor or Chronomex, probably [03:34] They wouldn't go for it [03:35] Mostly because of the way the place works (they vote on the person, not the organization) [03:35] But I tried! [03:35] underscor keynoting would be awwwweeessoommmmee [03:35] They'd not forget THAT [05:54] hehe [05:54] what organization was this? [06:44] ArchiveTeam for president! [07:26] balrog-, chronomex, good news! [07:26] oh yeah? [07:26] I wrote a script to fix the tiff2pdf issue [07:26] with the discolored PDFs [07:26] http://pastie.org/5107570 [07:26] rad [07:26] does a chunked read of a PDF [07:26] so it doesn't load all of it in memory at once [07:27] and replaces a certain string to fix the issue [07:27] and yes, it handles strings on the border between 2 chunks properly :P [07:27] if it detects part of the to-be-matched string existing at the end of a chunk [07:27] it'll read more to get the rest [07:28] so basically, it always loads at most 512kb of data [07:28] which means it should be possible to easily process a 1GB PDF if needed [07:28] without running out of RAM [07:28] oboy [07:28] also, I tested it ofc, and it works [07:29] thanks to these guys: http://www.asmail.be/msg0055295176.html for the fix :P [07:29] I'll be releasing a few scripts for scanning soon anyway [07:29] nice [07:30] I let archive.org's deriver make my pdfs though ;) [07:30] heh [07:30] anyway, it also has a simple automation script for scanning [07:30] interactive CLI script [07:30] you pick the device from a list, enter DPI, width, height [07:30] hit enter, and it scans a page [07:31] hit enter again, and it scans a page [07:31] saving them as incrementing numbers [07:31] and a separate script for re-scanning certain pages [07:32] so, seems I just finished my first comic book scan: http://aarnist.cryto.net:81/straal2.pdf :D [08:07] http://sphotos-a.xx.fbcdn.net/hphotos-ash3/46201_4497571931862_1789693667_n.jpg [08:09] SketchCow: gay [08:09] hahaha [08:10] also, I *may* have an idea for an ultra-cheap camera-based book scanner... but I'll have to see if the camera I have in mind is suitable. [08:10] so... searching through boxes it is [08:24] interesting... I actually get pictures of reasonable quality with this camera [08:26] after postprocessing: http://i.imgur.com/qtX9w.png [08:29] I wonder what kind of pictures I could get from this camera with a bit of optimization [08:52] tht hurts my eyes to look at ¬_¬ [08:52] joepie91: I bet alignment would help too [09:04] chronomex: problem is this is only 640 * 480 [09:04] and the focus isn't great [09:04] oh [09:04] because it obviously doesn't have autofocus [09:04] this thing *should* have a photo mode that does 1280x1024 photos [09:04] but it's behaving quite strangely [09:04] it goes into photo mode, but when I press the button it'll still just make a video [09:04] instead of taking a photo [09:04] :| [09:04] frustrating [09:05] it's this camera: http://www.chucklohr.com/808/C3/index.html [09:05] it's an awesome little camera otherwise but it's focused at far objects [09:06] so doesn't cope with book text too well :P [10:11] "hope it can help your life safe and happiness" - wut? :D [11:25] SmileyG: that's a play on the messages from Chinese eBay sellers [11:25] lol [11:35] :D [15:02] So, here we are deep into the WARC transfer of material, either my backhack conversions of previous projects, or the webshots upload. [15:03] I'm now waiting to see if anyone yells about the loading of the data, or the system or anything. [15:03] But looks like we have quite a lot to give them, and who knows. [15:04] i uploaded 2 more linux format dvds this morning [15:04] http://archive.org/details/cdrom-linuxformatmagazine-128 [15:05] http://archive.org/details/cdrom-linuxformatmagazine-136 [15:05] No need to tell me, godane. I'll get to you on my next sweep of you. [15:05] ok [15:05] i just feel better that my wifi is working again [15:16] hahaha [15:16] SketchCow: that would be awesome [15:16] although [15:16] I have not a lot of experience speaking [15:28] I'd have coached you. [15:36] <3 [17:30] :/ http://www.idigitaltimes.com/articles/12066/20121022/nbc-erases-snl-sketch-digital-archive-copyright.htm [17:43] https://twitter.com/shaneb/status/261159783921483776 [17:44] balrog-: Non discussion [18:24] balrog-: http://i.imgur.com/GVajj.png [18:24] joepie91: yeah I noticed