Time |
Nickname |
Message |
01:46
🔗
|
joepie91 |
btw, SketchCow, I think you may find this useful for keeping track of things: http://www.treesheets.com/ |
01:47
🔗
|
joepie91 |
(may also be useful for others, and it runs natively on Linux as well) |
02:23
🔗
|
SketchCow |
Copied FORTUNECITY/com/meltingpot/com-meltingpot-research-20120405-005316.warc.gz to warc |
02:23
🔗
|
SketchCow |
alard: |
02:23
🔗
|
SketchCow |
Checking FORTUNECITY/com/meltingpot/com-meltingpot-gambia-20120401-144041.warc.gz |
02:23
🔗
|
SketchCow |
Could not decompress warc.gz. gunzip returned 2. |
02:23
🔗
|
SketchCow |
Copying FORTUNECITY/com/meltingpot/com-meltingpot-gambia-20120401-144041.warc.gz to tar |
02:23
🔗
|
SketchCow |
So that's good. |
02:26
🔗
|
dashcloud |
did you see my note about the two Coming Soon items? |
02:30
🔗
|
SketchCow |
17:41 <@dashcloud> so reading the scrollback, I did a brief check of the items, and I came across Coming Soon, which has one item as WARCS, and there's a second item with a WARC file inside a zipfile |
02:30
🔗
|
SketchCow |
That, right? |
02:30
🔗
|
dashcloud |
yes |
02:30
🔗
|
SketchCow |
The thing 6 lines up? |
02:30
🔗
|
dashcloud |
sorry! |
02:30
🔗
|
SketchCow |
Or are you watching joins and parts? |
02:30
🔗
|
SketchCow |
Because I turned THAT shit off MONTHS ago. |
02:30
🔗
|
SketchCow |
I'd have gone insane. |
02:32
🔗
|
SketchCow |
http://archive.org/details/csoon-20111016 this one? |
02:32
🔗
|
SketchCow |
I see. |
02:32
🔗
|
SketchCow |
Yes, it's handled. The csoon-* is a WARC of the same |
02:32
🔗
|
SketchCow |
Good eye, though. |
02:33
🔗
|
dashcloud |
okay |
02:37
🔗
|
joepie91 |
ok, seriously, I love scantailor |
02:37
🔗
|
SketchCow |
scantailor fixes everything. |
02:37
🔗
|
joepie91 |
yes, pretty much |
02:37
🔗
|
joepie91 |
comics, books, it does all of it :o |
02:37
🔗
|
joepie91 |
and most of it automated |
02:37
🔗
|
joepie91 |
hell, it pretty much successfully cleaned up a book that was copied *on a typewriter* |
02:38
🔗
|
joepie91 |
on shitty spotty recycled paper |
02:38
🔗
|
SketchCow |
As my friend Dan Reetz likes to say, sometimes scantailor unwittingly fixes typesetting errors with books |
02:38
🔗
|
joepie91 |
heh |
02:38
🔗
|
SketchCow |
Where the plates were off by a millimeter or so |
02:38
🔗
|
joepie91 |
SketchCow: http://aarnist.cryto.net:81/vrijheid2.pdf |
02:38
🔗
|
joepie91 |
is the result |
02:38
🔗
|
chronomex |
oh yeah I've had books come out less crooked than the original |
02:38
🔗
|
joepie91 |
two pages are missing and I should rescan some pages because they were too fuzzy |
02:38
🔗
|
joepie91 |
but overall it's VERY nice |
02:39
🔗
|
joepie91 |
also, tiff2pdf somehow fucked up the front cover, not sure why :P |
02:39
🔗
|
chronomex |
that is a nice scan. |
02:40
🔗
|
joepie91 |
yes, yes it is :) |
02:40
🔗
|
joepie91 |
but yeah, a few pages definitely needs fixing |
02:40
🔗
|
joepie91 |
need * |
02:42
🔗
|
joepie91 |
109, for example, is a bit meh |
02:43
🔗
|
balrog- |
joepie91: tiff2pdf is picky about input tiff |
02:43
🔗
|
balrog- |
very, very picky |
02:47
🔗
|
joepie91 |
yes, so I've noticed |
02:47
🔗
|
joepie91 |
I suspect there's some color space fuckup or something |
02:48
🔗
|
joepie91 |
what I have noticed that has somewhat surprised me: it's possible to make scans of professional quality on Linux with free software alone |
02:48
🔗
|
joepie91 |
from scan to postprocessed PDF |
02:48
🔗
|
joepie91 |
and reasonably automate-able |
02:48
🔗
|
balrog- |
joepie91: if you or someone is willing to fix hocr2pdf or write a working alternative, then you can have OCRed too |
02:49
🔗
|
balrog- |
tesseract produces decent output |
02:49
🔗
|
joepie91 |
what language is it written in? |
02:49
🔗
|
balrog- |
C |
02:49
🔗
|
joepie91 |
ah, not my thing |
02:49
🔗
|
joepie91 |
though |
02:49
🔗
|
joepie91 |
I may know someone who can do that |
02:49
🔗
|
balrog- |
but there's hOCR-handling stuff in ruby and iirc in python |
02:49
🔗
|
chronomex |
ocropus too |
02:49
🔗
|
joepie91 |
will give him a poke :P |
02:49
🔗
|
joepie91 |
right |
02:49
🔗
|
balrog- |
does ocropus handle hOCR? |
02:49
🔗
|
chronomex |
idk |
02:49
🔗
|
balrog- |
the OCR step is mostly good |
02:49
🔗
|
balrog- |
the tricky part is putting the hOCR into the PDF |
02:49
🔗
|
joepie91 |
speaking of which, a potential nice archiveteam-project: build a fast book scanner with fully automated software 'chain' from scan/photo to OCRed ebook files |
02:49
🔗
|
joepie91 |
make it publicly accessible |
02:49
🔗
|
joepie91 |
"come turn your book into an ebook here for free" |
02:49
🔗
|
balrog- |
hOCR is the OCRed text in HTML format with tags indicating the location |
02:50
🔗
|
joepie91 |
and at the same time, archive/catalogue the scanned books |
02:50
🔗
|
chronomex |
http://en.wikipedia.org/wiki/HOCR says yes, ocropus and tesseract both |
02:50
🔗
|
balrog- |
that's software that OUTPUTS it |
02:50
🔗
|
joepie91 |
basically, IRL archiveteam project |
02:50
🔗
|
balrog- |
you need something to input it and stuff it into a PDF |
02:50
🔗
|
chronomex |
yep |
02:50
🔗
|
joepie91 |
balrog-: I'll have a look at it some time soon |
02:50
🔗
|
balrog- |
http://www.exactcode.com/site/open_source/exactimage/hocr2pdf/ |
02:51
🔗
|
balrog- |
svn.exactcode.de for the code |
02:51
🔗
|
joepie91 |
ok :) |
02:51
🔗
|
joepie91 |
but yeah, balrog-, chronomex, thoughts on IRL bookscanning project? |
02:51
🔗
|
balrog- |
well, I'd first need a bookscanner |
02:51
🔗
|
balrog- |
problem is, you don't want to know how many books I have. |
02:52
🔗
|
chronomex |
you have a lot |
02:52
🔗
|
chronomex |
got it |
02:52
🔗
|
joepie91 |
well |
02:52
🔗
|
joepie91 |
idk if I pasted this, but I ran across a video of a bookscanner |
02:52
🔗
|
joepie91 |
that would do the job |
02:52
🔗
|
joepie91 |
and I think it should be fairly inexpensive to build |
02:52
🔗
|
chronomex |
the automatic one with the wedge? |
02:52
🔗
|
chronomex |
yeah that's cool |
02:52
🔗
|
joepie91 |
yeah |
02:52
🔗
|
chronomex |
dunno about getting the sensors right down at the tip tho |
02:52
🔗
|
joepie91 |
all you need is basically a strong servo, a compressor, and two scanner units |
02:52
🔗
|
joepie91 |
(I think) |
02:52
🔗
|
chronomex |
s/servo/stepper/ |
02:52
🔗
|
joepie91 |
I suck at terms |
02:52
🔗
|
joepie91 |
stepper, right |
02:53
🔗
|
joepie91 |
terminology* |
02:53
🔗
|
joepie91 |
... wow, that was a self-proving statement lol |
02:53
🔗
|
chronomex |
you need + and - air |
02:53
🔗
|
joepie91 |
right, I know some people here that can probably do that |
02:53
🔗
|
joepie91 |
and they probably have the parts for it, too |
02:54
🔗
|
joepie91 |
but yeah, it would be sort of epic to just have a book scanner somewhere in a public space, where anyone can scan a book and get the resulting ebook emailed to him |
02:54
🔗
|
joepie91 |
and at the same time have the source files and postprocessed files archived centrally |
02:54
🔗
|
joepie91 |
and judging from the software that is available, that should be fairly easy to automate |
02:55
🔗
|
joepie91 |
but then a camera setup would probably be best |
02:55
🔗
|
joepie91 |
for starters |
02:55
🔗
|
joepie91 |
since the wedge thing is a bit.. large :P |
02:56
🔗
|
chronomex |
yea |
02:56
🔗
|
joepie91 |
and while the camera bookscanner can run off some kind of battery, that will be tricky for the wedge model |
02:56
🔗
|
joepie91 |
I mean, you could just put the camera bookscanner somewhere outside a mall temporarily |
02:57
🔗
|
joepie91 |
and run it off a battery and local storage |
03:01
🔗
|
chronomex |
have it spit out usb sticks or something |
03:01
🔗
|
chronomex |
"insert usb stick or sd card to receive a pdf!" |
03:01
🔗
|
joepie91 |
possible as well |
03:01
🔗
|
joepie91 |
maybe offer both USB and SD for instant ebook |
03:02
🔗
|
joepie91 |
or "give your email and we'll send it at the end of the day" as alternative |
03:02
🔗
|
joepie91 |
since USB sticks and SD cards tend to get lost :P |
03:03
🔗
|
joepie91 |
combine a custom python script using python-imaging-sane or whatever is needed to take webcam pictures (depending on setup) |
03:03
🔗
|
joepie91 |
with postprocessing via scantailor-cli |
03:04
🔗
|
joepie91 |
then tiffcp and tiff2pdf |
03:04
🔗
|
joepie91 |
and optionally calibre to produce a .mobi and .epub |
03:10
🔗
|
chronomex |
would we trust the user to metadata |
03:11
🔗
|
chronomex |
I don't trust anyone to metadata unless they're 1) a librarian, 2) super picky, or 3) me |
03:11
🔗
|
chronomex |
I suppose 2 is redundant |
03:11
🔗
|
joepie91 |
I'd say, let the user give metadata first |
03:11
🔗
|
joepie91 |
then review before final archival |
03:11
🔗
|
chronomex |
aye |
03:11
🔗
|
joepie91 |
at the end of the day |
03:11
🔗
|
joepie91 |
you have to review anyway |
03:11
🔗
|
joepie91 |
to get rid of any personal markings |
03:11
🔗
|
joepie91 |
owner names, stamps, etc |
03:11
🔗
|
chronomex |
yeah proofing metadata against a title page is pretty straightforward |
03:11
🔗
|
chronomex |
no |
03:11
🔗
|
chronomex |
leave that in |
03:12
🔗
|
joepie91 |
that'll cause an issue for people |
03:12
🔗
|
chronomex |
hm? |
03:12
🔗
|
joepie91 |
I doubt they'd want their name associated with a scan |
03:12
🔗
|
chronomex |
oh |
03:12
🔗
|
chronomex |
tell them not to scan the bookplate then? |
03:12
🔗
|
joepie91 |
that's no use when scanning is automated :P |
03:12
🔗
|
chronomex |
oh |
03:12
🔗
|
joepie91 |
most people write their name in the inside |
03:12
🔗
|
chronomex |
ummmmm |
03:12
🔗
|
* |
chronomex shrugs |
03:12
🔗
|
chronomex |
I hadn't considered that |
03:12
🔗
|
joepie91 |
you can just blank that out, it's typically not written over any actual book content |
03:13
🔗
|
chronomex |
true |
03:13
🔗
|
joepie91 |
same for stamps, they're usually on the inside cover |
03:13
🔗
|
joepie91 |
in the blank area |
03:13
🔗
|
chronomex |
you could offer the scanning person an option to do that themselves |
03:13
🔗
|
joepie91 |
true |
03:13
🔗
|
joepie91 |
but you have to be careful to not introduce too many variables and options |
03:14
🔗
|
joepie91 |
or the whole appeal of an ""ebookify your book here" machine will be gone |
03:14
🔗
|
joepie91 |
it's a tricky thing to average :P |
03:15
🔗
|
chronomex |
yes |
03:18
🔗
|
joepie91 |
good point: if it requires manual pageturning, people won't do it |
03:34
🔗
|
SketchCow |
Tried to get one of you guys a keynote for a conference. |
03:34
🔗
|
SketchCow |
underscor or Chronomex, probably |
03:34
🔗
|
SketchCow |
They wouldn't go for it |
03:35
🔗
|
SketchCow |
Mostly because of the way the place works (they vote on the person, not the organization) |
03:35
🔗
|
SketchCow |
But I tried! |
03:35
🔗
|
SketchCow |
underscor keynoting would be awwwweeessoommmmee |
03:35
🔗
|
SketchCow |
They'd not forget THAT |
05:54
🔗
|
chronomex |
hehe |
05:54
🔗
|
chronomex |
what organization was this? |
06:44
🔗
|
ersi |
ArchiveTeam for president! |
07:26
🔗
|
joepie91 |
balrog-, chronomex, good news! |
07:26
🔗
|
chronomex |
oh yeah? |
07:26
🔗
|
joepie91 |
I wrote a script to fix the tiff2pdf issue |
07:26
🔗
|
joepie91 |
with the discolored PDFs |
07:26
🔗
|
joepie91 |
http://pastie.org/5107570 |
07:26
🔗
|
chronomex |
rad |
07:26
🔗
|
joepie91 |
does a chunked read of a PDF |
07:26
🔗
|
joepie91 |
so it doesn't load all of it in memory at once |
07:27
🔗
|
joepie91 |
and replaces a certain string to fix the issue |
07:27
🔗
|
joepie91 |
and yes, it handles strings on the border between 2 chunks properly :P |
07:27
🔗
|
joepie91 |
if it detects part of the to-be-matched string existing at the end of a chunk |
07:27
🔗
|
joepie91 |
it'll read more to get the rest |
07:28
🔗
|
joepie91 |
so basically, it always loads at most 512kb of data |
07:28
🔗
|
joepie91 |
which means it should be possible to easily process a 1GB PDF if needed |
07:28
🔗
|
joepie91 |
without running out of RAM |
07:28
🔗
|
chronomex |
oboy |
07:28
🔗
|
joepie91 |
also, I tested it ofc, and it works |
07:29
🔗
|
joepie91 |
thanks to these guys: http://www.asmail.be/msg0055295176.html for the fix :P |
07:29
🔗
|
joepie91 |
I'll be releasing a few scripts for scanning soon anyway |
07:29
🔗
|
chronomex |
nice |
07:30
🔗
|
chronomex |
I let archive.org's deriver make my pdfs though ;) |
07:30
🔗
|
joepie91 |
heh |
07:30
🔗
|
joepie91 |
anyway, it also has a simple automation script for scanning |
07:30
🔗
|
joepie91 |
interactive CLI script |
07:30
🔗
|
joepie91 |
you pick the device from a list, enter DPI, width, height |
07:30
🔗
|
joepie91 |
hit enter, and it scans a page |
07:31
🔗
|
joepie91 |
hit enter again, and it scans a page |
07:31
🔗
|
joepie91 |
saving them as incrementing numbers |
07:31
🔗
|
joepie91 |
and a separate script for re-scanning certain pages |
07:32
🔗
|
joepie91 |
so, seems I just finished my first comic book scan: http://aarnist.cryto.net:81/straal2.pdf :D |
08:07
🔗
|
SketchCow |
http://sphotos-a.xx.fbcdn.net/hphotos-ash3/46201_4497571931862_1789693667_n.jpg |
08:09
🔗
|
ersi |
SketchCow: gay |
08:09
🔗
|
joepie91 |
hahaha |
08:10
🔗
|
joepie91 |
also, I *may* have an idea for an ultra-cheap camera-based book scanner... but I'll have to see if the camera I have in mind is suitable. |
08:10
🔗
|
joepie91 |
so... searching through boxes it is |
08:24
🔗
|
joepie91 |
interesting... I actually get pictures of reasonable quality with this camera |
08:26
🔗
|
joepie91 |
after postprocessing: http://i.imgur.com/qtX9w.png |
08:29
🔗
|
joepie91 |
I wonder what kind of pictures I could get from this camera with a bit of optimization |
08:52
🔗
|
SmileyG |
tht hurts my eyes to look at ¬_¬ |
08:52
🔗
|
chronomex |
joepie91: I bet alignment would help too |
09:04
🔗
|
joepie91 |
chronomex: problem is this is only 640 * 480 |
09:04
🔗
|
joepie91 |
and the focus isn't great |
09:04
🔗
|
chronomex |
oh |
09:04
🔗
|
joepie91 |
because it obviously doesn't have autofocus |
09:04
🔗
|
joepie91 |
this thing *should* have a photo mode that does 1280x1024 photos |
09:04
🔗
|
joepie91 |
but it's behaving quite strangely |
09:04
🔗
|
joepie91 |
it goes into photo mode, but when I press the button it'll still just make a video |
09:04
🔗
|
joepie91 |
instead of taking a photo |
09:04
🔗
|
joepie91 |
:| |
09:04
🔗
|
joepie91 |
frustrating |
09:05
🔗
|
joepie91 |
it's this camera: http://www.chucklohr.com/808/C3/index.html |
09:05
🔗
|
joepie91 |
it's an awesome little camera otherwise but it's focused at far objects |
09:06
🔗
|
joepie91 |
so doesn't cope with book text too well :P |
10:11
🔗
|
SmileyG |
"hope it can help your life safe and happiness" - wut? :D |
11:25
🔗
|
joepie91 |
SmileyG: that's a play on the messages from Chinese eBay sellers |
11:25
🔗
|
joepie91 |
lol |
11:35
🔗
|
SmileyG |
:D |
15:02
🔗
|
SketchCow |
So, here we are deep into the WARC transfer of material, either my backhack conversions of previous projects, or the webshots upload. |
15:03
🔗
|
SketchCow |
I'm now waiting to see if anyone yells about the loading of the data, or the system or anything. |
15:03
🔗
|
SketchCow |
But looks like we have quite a lot to give them, and who knows. |
15:04
🔗
|
godane |
i uploaded 2 more linux format dvds this morning |
15:04
🔗
|
godane |
http://archive.org/details/cdrom-linuxformatmagazine-128 |
15:05
🔗
|
godane |
http://archive.org/details/cdrom-linuxformatmagazine-136 |
15:05
🔗
|
SketchCow |
No need to tell me, godane. I'll get to you on my next sweep of you. |
15:05
🔗
|
godane |
ok |
15:05
🔗
|
godane |
i just feel better that my wifi is working again |
15:16
🔗
|
underscor |
hahaha |
15:16
🔗
|
underscor |
SketchCow: that would be awesome |
15:16
🔗
|
underscor |
although |
15:16
🔗
|
underscor |
I have not a lot of experience speaking |
15:28
🔗
|
SketchCow |
I'd have coached you. |
15:36
🔗
|
underscor |
<3 |
17:30
🔗
|
balrog- |
:/ http://www.idigitaltimes.com/articles/12066/20121022/nbc-erases-snl-sketch-digital-archive-copyright.htm |
17:43
🔗
|
SketchCow |
https://twitter.com/shaneb/status/261159783921483776 |
17:44
🔗
|
SketchCow |
balrog-: Non discussion |
18:24
🔗
|
joepie91 |
balrog-: http://i.imgur.com/GVajj.png |
18:24
🔗
|
balrog- |
joepie91: yeah I noticed |