[01:12] so, Monoprice is bringing IPS displays to the masses: http://www.monoprice.com/products/product.asp?c_id=109&cp_id=10909&cs_id=1090901&p_id=9579&seq=1&format=2 [01:15] same monitor on ebay from a thousand koreans for $50 less [01:16] hmm- buy from Ebay, or buy from reputable vendor with warranty? [01:16] monoprice offers a warranty? [01:16] you can chargeback with ebay too [01:16] As you've come to expect from Monoprice, we stand behind our products and offer a full 1 year warranty, which is at least 3-4 times what is offered by other monitor manufacturers. Additionally, we are so confident of the quality of these displays that we are guaranteeing these monitor will have less than 5 dead pixels. If you can count 5 dead pixels anywhere on the screen, we'll give you a new one. By comparison, [01:16] the industry standard, even for industry leaders like Apple and LG, is 10 dead pixels or even more. [01:16] not bad [01:16] one I bought had 0 dead/stuck pixels :P [02:13] uploaded: https://archive.org/details/bitgamer-archive [02:47] Moved to archiveteam [05:13] starting botfeed for archivist [05:30] First test works, time to run the automated ingestor. [05:31] om nom nom [05:35] 103 books, just for one MegaHAL [05:39] 13:39:34 up 122 days, 3:45, 1 user, load average: 1.29, 1.10, 0.78 [05:46] i think we need something in wget so you can download only images form other hosts [05:47] sort of --accept-regex-host or something [05:48] this way when you mirror sites that has lot of external images you can do a -H --accept-regex-host='(jpg|jpeg|gif|png)' or something [05:48] i use underground gamer for example [05:49] it has tons of images hosted on it but also a ton hosted other websites [16:45] interesting observation/argument coming up ... it seems most of the big "disk preservation" groups aren't interested in a large portion of what's out there [16:46] SPS only wants games; redump and the like mainly focus on games; pretty much no one cares about cracked/"pirated" materials (stuff from the 80s and 90s and such, not current, but current should not be ignored either imho) — they only want original [16:47] Where is it coming up? [16:48] in #messdev and the private mame list [16:48] balrog-: I'd argue that period of time is especially interesting with regards to cracked versions [16:48] because of all the demos and such from the various groups [16:48] cracktros [16:48] etc [16:49] yes, absolutely. [16:49] I mean, the only ones I can recall now that still do cracktros [16:49] are hoodlum and fff [16:49] and even those, sparingly [16:49] idk if hoodlum even still exists [16:49] Spoiler: I've come to not like SPS all that much. [16:49] I respect the technical effort and the commitment to data acquisition. [16:50] yeah.. screw the millions of floppies with people's private data they might be interested in getting back. absolutely no importance there [16:50] they flat out state they only want games. They won't even accept OS releases, and such [16:50] SketchCow: i got a broadcast copy of the screen savers from 2003.07.14 [16:50] (in case it is missed: ) [16:50] very good copy [16:51] there are other issues, but I don't want to get into those [16:54] regarding SPS) [16:54] issues with MAME/MESS dumping and such ... I wouldn't mind discussing this more but I know SketchCow is extremely busy. [16:56] thank god for u-g [16:57] UG is a help, but the problem here is deeper :( [16:57] I'm happy to discuss it, but yeah, I'm busy in a general sense. [16:57] we have efforts like the dumping union (another private group), but they only care about arcades [16:57] but it's a discussion worth having, so go ahead. I'm getting a lot done in other windows [16:57] which means people like myself end up shelling hundreds of dollars on various equipment to dump and reverse-engineer [16:58] shelling out* [16:59] I just brought up like 4 different issues [16:59] : [16:59] :/ * [16:59] SketchCow: fileplanetfileplanetfileplanetfileplanetfileplanet ;P [17:00] i admire byuu's work for preserving [17:00] this feels like a game of whack-a-mole, or the mythical hydra — fix one problem, three others appear. [17:00] there's no way one or three people will be able to solve this [17:01] Schbirid: Sorry about that - the slowdown is that I need to set aside a chunk of time to make sure the whole thing goes smoothly, because one mistake kills terabytes [17:02] np, if i nag too much, just say [17:03] yeah I feel the same [17:04] the more annoying thing that I see is the costs of some of this stuff, which we end up paying anyway to preserve it [17:04] now, imagine a world with a quickly expiring copyright. things would be so much easier [17:04] and museums? I doubt most museums would be willing to do something like this: http://kevtris.org/Projects/votraxml1/ [17:05] (take a look at one of the Board pages) [17:05] i am sure they would love to, but funding... :( [17:05] For what it's worth, "there's no way one or three people will be able to solve this" fails to take into account that one of those people might be me. [17:05] SketchCow: yes, yes this is true. I mean people like myself, not like you :) [17:05] Schbirid: no, most museums would not apply heat to an artifact to remove epoxy potting in order to document how it works and repair it. [17:06] SketchCow: keep your nerves in mind, you cannot save everything [17:06] balrog-: aye, i spoke too soon [17:06] Schbirid: remember, SketchCow is good at finding other people who are able to help more ;) [17:06] heh [17:06] and yeah — good luck finding a votrax ml-1. not very many were made to begin with [17:07] http://archive.org/details/dragon_magazine [17:07] Let's see who screams [17:08] I think it would go a long way if something like dumping union could be organized for non-arcade hardware. Things get more tricky, because while many arcade board types are well understood, many computers need poking and probing [17:09] I've said it before but my main beef with sps is lack of transparency [17:09] lately I've been doing research/reverse-engineering of early digital synth hardware, and figuring out secret modes and "tricks" to dump various early protected chips. [17:09] DFJustin: I'm not even talking about SPS in particular here. [17:09] I don't like SPS, but that's besides the point. [17:10] if all we do is talk about how we don't like SPS, we are missing the bigger picture [17:11] http://archive.org/details/magazine_rack - watch that space [17:11] it's about to get fucking huge [17:12] :) [17:13] http://www.crackajack.de/2013/01/09/vintage-man-machine-interface/ [17:15] stuff like that .... obtaining one, figuring out how it works, and writing decent emulation is not all that easy [17:18] I suppose everyone here knows that though [17:20] Last year, the first year I was working for Internet Archive, I was focused on several things. Among them was easier scripting to ingest massive amounts of data into the archive. For that I was rather successful - even outside of archive team specific chicanery, I pulled in something like 100 terabytes of data. [17:20] that's impressive... and quit important. [17:20] quite* [17:20] And I found that it's getting very easy, not 100%, but much easier to absorb most of the folksonomic scans and digitizations people have done over the last decade or so. [17:21] So that is ongoing. In this week's work, the integration of bitsavers will bring 25,000 computer documents into the world in an easy to browse fashion. [17:21] bitsavers? nice. Be warned only the newer stuff there is OCRed... and as you probably know already, new stuff keeps getting added. [17:21] there's also bitsavers/bits [17:21] Tell me more [17:21] Also, they often feature computers [17:22] The functionality of what I've created is AUTOMATIC ingestion. [17:22] It'll just run with each new addition of material. [17:22] oh, cute: they have the code for XINU, explained in "Comer, Douglas E., Operating System Design: The Xinu Approach, Prentice-Hall" — I have that book [17:22] It's a similar approach to http://archive.org/details/dnalounge [17:23] That has no human intervention. [17:23] IA OCRs and adds a text layer to anything that doesn't already have it [17:23] and with good enough accuracy, right? [17:23] fulltext OCR search is quite nice to have, in addition to the metadata [17:23] OCR at archive is shit. [17:23] See, you can't do this, balro. [17:23] This is how projects fail. [17:23] I've found some pretty useful stuff because Google OCRs PDFs as they index them. [17:24] yes, the OCR is not great, but in many cases it's good enough [17:24] You get the foundations in the most non-intrusive decision making possible, and THEN you go "what about the curtains? do we have peonies or sunflowers in the front yard?" [17:24] If you oscillate between "oh god, floppies are dying and SPS doesn't care" and "but what about the OCR accuracy", that's how you don't get stuff done. [17:25] You get paralyzed. [17:25] Move in waves. [17:25] I'm not asking about OCR accuracy. I'm asking about OCR indexing [17:25] You're asking about something above "get it all online" [17:25] even google's crappy OCR is useful because it goes in an index that can be fulltext searched. [17:25] Which is the first problem. [17:26] this may be a good time to mention I just got ham radio magazine 1985-1986 from emule, do we have that already [17:26] yes, it was made dark. [17:27] balrog-: I think what SketchCow is trying to explain, is that if you start trying to create additional functionality at this point, you are leaving the problem before it partially unsolved (getting everything available in the first place) [17:27] yes, that is true [17:27] and losing focus and manpower to solve that problem [17:27] Right now, in my house, I have negotiated and I have, a $25,000 Scribe digitizer from Internet Archive. [17:27] It's in the other room, I've been setting it up. [17:28] My official name in their system is Internet Archive Poughkeepsie [17:28] You realize what this means. [17:28] Pro-level digitization is now not subject to justification for computer documents. [17:28] there's a place where volunteers can scan manuals and other items. [17:28] It's right here. [17:29] but you have to be very careful when getting everything, to not miss important things. this is more of an issue with hardware than software. Plenty of arcade boards and other boards had ROMs mis-dumped or certain ones not dumped at all in the past, and with rare prototypes, collectors are rarely willing to allow anyone to touch them once they have them. [17:29] so if I have time, I can drag my paper documents there and scan them? [17:29] Yes. [17:29] manuals, schematics, etc [17:29] the Scribe is designed for bound books, correct? [17:30] Books that open with a side bound, yes [17:30] but for stuff that can easily be unbound or that I'm willing to unbind, I'm better off doing so and sheetfeeding...? [17:31] another question I've had for a while, and this is not specifically for SketchCow: has anyone done work on postprocessing color scans? [17:32] balrog-: postprocessing in what sense? [17:33] I've been scanning a few comics and running them through scan tailor, which went fine [17:33] needed some manual adjustments, but otherwise it was great [17:33] taking the multi-gb scans and compressing them down into something that doesn't take so much space yet has sufficient quality [17:33] btw the folkscanomy collection should be linked off http://archive.org/details/additional_collections so people can find it [17:33] hmmm [17:33] bilevel you can use G4 Fax compression which is great [17:33] what format re the scans in? [17:33] are * [17:33] usually uncompressed or lzw or zip tiff [17:33] because I'd imagine this would be something for imagemagick or similar [17:33] some form of tiff, basically [17:34] yeah I have been using imagemagick but color is just a pain [17:34] For bilevel PDF I usually prefer jbig2; it's typically much much smaller, even lossless! [17:34] how come..? [17:34] does pdf support jbig2? [17:34] huh it does. [17:34] Yeah, it has for a few versions back. Most readers now support it too. [17:35] also, balrog-... for lossless color scans you're probably looking at something like PNG... for lossy, I'm not sure [17:35] but lossless scans will be HUGE still, probably [17:35] joepie91: tiff+zip and png are rather identical [17:35] really? [17:35] yes, because png uses deflate [17:35] weird, I thought I saw different results [17:35] tiff+zip isn't tiff that's zipped, it's tiff with deflate [17:36] For some reason reminds me, TIFF-LZW for 16 bit per channel scans has hilarious results. [17:36] tiff-lzw is ... weird [17:36] joepie91: what tool do you use to create jbig2 pdfs? [17:36] is there a pdf2pdf that can recompress pdfs to jbig2? [17:36] you sure you wanted to address me and not mistym? :P [17:36] as I don't think I ever used jbig2 [17:36] err, mistym ^^ [17:36] :D [17:36] heh [17:37] also, this may come in handy [17:37] tiff2pdf WILL fuck up your PDFs when JPG compression is used [17:37] I have a fixing script for that [17:37] for colour just use a high quality jpg, yes the file size will be big but who cares, better than having to rescan everything later when everyone has PB hard drives because the quality was shit [17:37] balrog-: http://git.cryto.net/cgit/scantools/tree/fix-pdf [17:38] can you get tiff2pdf patched maybe? [17:38] or even better jp2 [17:38] (upstream) [17:38] if you use tiff2pdf and it results in inverted JPGs in your PDF, then run that [17:38] joepie91: I get inverted tiffs in my pdf with tiff2pdf [17:38] it will comfortably handle multi-GB PDFs with very little memory, because it does chunked reads and writes [17:38] balrog-: You could probably script it pretty easily for PDFs containing a single image layer by extracting to a sequence of TIFFs, then throwing that into jbig2enc, then tossing the results back into a PDF. [17:38] also I have no idea how to fix it upstream [17:38] but it's a known bug [17:38] been having to use tiffcrop -I to correct :( [17:39] mistym: yeah, but what tool to reassemble? [17:39] anyway, balrog-, run your PDFs to that script and they will magically be fixed [17:39] joepie91: ok [17:39] had to write it to fix up my comic scans [17:40] oh so that replaces ColorTransform 0 with ColorTransform 1 in some places? [17:40] balrog-: jbig2enc includes a pdf.py script that turns the raw jbig2 into PDFs. If you do it on a page-by-page basis, it's then not too hard to combine the individual pages back into one PDF. Or (if you don't mind increasing the system requirements for reading) you can do the whole jbig2 compression using a single dictionary across multiple pages. [17:40] balrog-: yes, but it does a chunked search and replace [17:40] so it reads in small chunks [17:40] so it doesn't have to load the entire PDF into RAM at once [17:40] ah, yeah. [17:40] and even handles edge cases [17:41] where the search string is over the edge of two chunks [17:41] ugh, all this needs to be on a wiki [17:41] and *even* handles false positives immediately followed by a match :P [17:41] Yeah, I should write this down... [17:41] so all edge cases should be covered [18:04] wow, jbig2... [18:04] mistym: slow to view though :p [18:05] balrog-: How slow it is depends on how big a dictionary you created :b One dictionary per page is pretty speedy. One dictionary per 100 pages (or more) is slow. [18:05] Just went past 10,000 documents on http://archive.org/search.php?query=collection%3Abitsavers&sort=-publicdate [18:06] Also, IT BEGINS [18:06] http://archive.org/search.php?query=collection%3Amagazine_rack&sort=-publicdate [18:07] mistym: I used the method in the readme: $ jbig2 -s -p -v *.jpg && pdf.py output >out.pdf [18:07] created many .NNNN files and one .sym file [18:08] That'd be one dictionary for the entire document, then. More efficient compression-wise, but the bigger the dictionary the slower decoding will be. [18:08] yeah, and for a 386-page document... [18:08] how do you do one dict per page or one per 25 pages? [18:08] http://archive.org/details/texwiller_magazine (622 issues) [18:09] :/ [18:09] oh btw, you might want to make it clearer what it means when books only appear as encrypted DAISY. A friend of mine was confused by that. [18:10] What do you mean, what it means. [18:13] some kind of auto-generated blurb saying "This book has been scanned by IA but is still under copyright so it is not available to read unless you have perceptual disabilities and have registered with such-and-such US government program" would be nice [18:13] yes, that. [18:14] or I get messages from friends as follows: http://pastebin.com/zFk1xrWG [18:15] the musicbrainz cover art collection is also confusing people, understandable when you look at the page title that shows up on google https://archive.org/details/mbid-8a51ac29-77a4-4d25-9f75-8efcc25b0c33 [18:16] that's less of an issues, since it says "Cover Art Collection" [18:16] I'm not looking to appeal to the lowest common denominator of people [18:16] yeah but it's basically spamming google with thousands of album titles + "free download & streaming" [18:16] err, Cover Art Archive rather [18:17] but yeah should be obvious once you arrive [18:17] the DAISY thing is not so obvious, and therein lies the issue [18:25] people are stupid [18:26] yes, but we don't want to cater to each and and every stupid person [18:26] Coderjoe: SPOILER ALERT [18:27] balrog-: at what point did I say I wanted to? [18:27] no, I'm just saying it's not worth it. [18:27] mistym: answer? :) [18:28] I'd like to know how to do this without making a pdf that crashes most viewers, even the most efficient [18:28] balrog-: Sorry, lost track of this. [18:28] it's ok ;) [18:31] Anyway: rather than one invocation of jbig2, either do one per page, or slice your set of images into groups of however many. [18:31] jbig2enc will use one dictionary for all of the input files you give it. [18:31] will pdf.py be able to assemble multiple sets? [18:32] or do I have to then merge pdfs? [18:33] You'll need to merge them. I *think* pdfbeads may have a feature to do this for you; let me check. [18:34] I don't like pdfbeads because it breaks things into lossy backgrounds anyway :/ [18:34] pdftk apparently can [18:34] and pdfunite [18:36] Didn't realize pdfbeads always forced lossy backgrounds. That sucks. [18:36] It does provide a --pages-per-dict option though, which is what I meant. [18:36] ahh. [18:37] and yeah, lossy backgrounds does suck [18:37] It does that even if you only provide it one layer? [18:37] maybe that's not mandatory but I don't want to split into backgrounds *at all* [18:38] What files are you feeding into it? [18:38] into what, pdfbeads? [18:38] it's been a few months since I've used it [18:39] Yeah, I was just wondering about your input files. I haven't peeked at the code, but the help text implies that it doesn't always attempt to split into multiple layers. If the input file is already binarized, maybe it produces only a single layer? [18:39] they're single-layer tiffs [18:48] Anyway, I guess joining the multiple PDFs later with another tool is probably just as easy. [18:48] yeah. ok [18:50] I tried feeding already bitonalized data into pdfbeads and it seems to be hanging forever, so boo to it. [21:05] http://www.staples.com/VuPoint-Magic-Wand-Portable-Scanner-Black/product_900544 [21:06] i plan on buying that on ebay [21:06] i get a used one for like less then $10 [21:10] i bid on this: http://www.ebay.com/itm/VuPoint-Magic-Wand-PDS-ST415-VP-Handheld-Scanner-/330855555020?pt=US_Scanners&hash=item4d08871fcc&autorefresh=true [21:11] same one thats on staples but for %90 off [21:11] they might suck badly [21:12] we will see [21:12] i want to scan the pages and not have crappy flip cam snapshots [21:20] Schbirid: have you use one of those scanners before? [21:20] nope [21:21] ok [21:21] was hoping for a sample scan [21:21] i only have a REALLY crappy normal scanner [21:23] I know they used to suck but I haven't even seen one since the mid 90s [21:23] so who knows [21:23] used flatbeds are mad cheap though so I'm not sure what the point is [21:26] easier to scan books i think [21:26] you do not have to stress the back binding [21:28] nighty [21:38] http://archive.org/details/magazine_rack [21:44] missing an L on "e Scienze is an Italian science magazine" [21:46] Fixed [23:40] This is the kind of crazy I can get behind: http://www.wired.com/threatlevel/2013/01/corporation-carpool-flap/ Guy rides in the carpool lane with his papers of incorporation- the paper is the corp, and the corp is a person, so he's got two people in the car