[00:16] alard: i've been given the green light to push proxy code into warctools. [00:39] shaqfu: Probably one of those things that disappeared with Versiontracker or whatever :/ [00:40] mistym: Hopefully not. Going to wait until Sunday (2 wks since I emailed) and try his work address [00:41] Since I'm not aware of any other checksum software that'll gracefully handle MacOS files [00:44] I guess you already looked on cpan...? [00:47] Hm, didn't think to. Trying now [00:48] Nope [00:49] Darn. Would have been nice to have been that easy! [00:50] Dealing with Mac files can be funny. Once dealt with an open source project that only supported using a set of Mac resource files if you obtained them on a non-Mac platform. If you had the Mac files on a Mac computer you were out of luck. [00:50] Weird - I guess it couldn't handle the forks? [00:51] Yeah, that was the problem. Only way it handled the forks was in a special MacBinary file format which was invented by a Windows tool in order to keep the forks together on non-Mac filesystems. [00:52] mistym: And now you know why I want something that handles forks natively :) [00:52] Exactly ;) [00:58] http://blog.greenpirate.org/hackitat-a-film-about-political-hacking/ [00:58] this oughter be a good'n [01:40] Oh, neat - might have a shot at "20-30 boxes worth" of old computer mags [01:40] ASking for titles/ranges now [07:43] hm I think I've exhausted ia601206's disk [07:43] http://www.us.archive.org/log_show.php?task_id=117980926 http://www.us.archive.org/log_show.php?task_id=117982231 http://www.us.archive.org/log_show.php?task_id=117985215 http://www.us.archive.org/log_show.php?task_id=117988133 [07:45] no, must be something else [07:50] waiting for kisk fix [07:50] disk [07:54] mmm [07:54] not too often you see /dev/sdal1 (SDAL1) [07:56] Nemo_bis: ia701206 is down, which is why that happened [07:57] ok [08:03] Nemo_bis: you broke it! http://i.imgur.com/7Bs6Z.png [08:03] :D [08:03] it'll probably be down until tomorrow unless one of the ops guys happens to wake up and see the message [08:05] imagine all the IA boxes with messages like that, but black background and saying: DATA LOST [08:05] haha [08:06] and this playing in the background http://www.freesound.org/people/murcielago123/sounds/81459/ [08:07] even better [08:07] http://www.freesound.org/people/jleedent/sounds/82392/ [08:08] that is the music + ia members shouting and running [08:08] http://www.freesound.org/people/DaveCarter/sounds/109139/ [08:30] SketchCow: your follow up is fucking awesome [08:31] I even shaved [08:32] :-O! [08:45] hmmm [08:45] love it when one of our devs randomly decides to reboot a server [08:45] and billions of nagios warnings go off [08:45] people going "woaaah???" [08:53] follow up? [08:54] oh wow he actually put up the KS [09:10] http://www.bytecellar.com/photo_pano.html [09:20] i should try and see if my atari 520st still works [09:21] my at&t pc6300 is long dead and shot up [10:40] I just have to admit that I always got something out of watching data getting transferred, watching data flow into the archive brings back memories from modem days... [12:31] hmm [12:31] supposedly linux 0.02 through 0.10 source code has been lost [12:31] I wonder if it's hiding someplace [12:34] balrog_: i dont think so http://archive.org/details/git-history-of-linux [12:37] wow, http://torrentfreak.com/demonoid-raid-credited-to-ifpi-multiple-arrests-in-mexico-reported-120809/ [12:39] fuck torrentfreak, they are such a bad piece of hyping tabloid rubbish money making poop machine [12:39] http://www.ifpi.org/content/section_news/20120809.html [12:39] rarely linking to their sources [12:40] *proud to have reject their "offer" to write for them years ago* [12:40] gawker should buy them [12:40] rant over [12:40] yahoo [12:40] haha, yes [12:40] that would be a great day. yahoo buying gawker and huffpo [12:42] lol, why? [12:45] because yahoo kills things [12:47] Use #archiveteam-bs dudes [12:53] i may do a warc.gz of ifpi.org now [12:53] since a lot of pdfs are on there [13:01] part me wonders how big ifpi.org is to download [13:01] i will not visit you in guantanamo! [13:02] cause i'm mirroring there site? [13:05] looks like wayback magazine has spotly archives of ifpi.org [13:06] 2005 seams to have been the year it was archived the most [14:40] I wonder if someone will answer: http://archive.org/post/426995/finereader-11 [14:43] i think that the best OCR solution for IA books is software like this http://beta.fromthepage.com/display/display_page?ol=w_rw_p_pl&page_id=378 [14:43] crowdsourcing OCR [14:44] yeah [14:49] it's a horrible waste of time if the starting OCR is not good [14:49] you do archive the non-ocr files right? [14:49] just thinking that way you can re-ocr the files again [14:50] another idea, Internet Archive reCAPTCHA [14:51] emijrp: https://www.mediawiki.org/wiki/Requests_for_comment/CAPTCHA [14:52] have you traveled to any wikimaniaÃ? [14:52] heh, those Image CAPTCHAs are broken already [14:54] emijrp: a couple [14:55] uploaded: http://archive.org/details/ifpi.org-20120810-mirror [15:00] uploading more gbtv while download techcrunch [16:01] OK, small challenge. [16:02] Small. [16:02] http://fos.textfiles.com/web.me.com-shindelltravels.warc.gz [16:02] give me the files inside it [16:02] I'm fucking sick of fucking with .warc.gz [16:03] SketchCow: thats why i always upload a .tar.gz of my mirror dumps [16:03] and warc.gz for the archive waybackmagazine [16:04] Thanks. [16:04] You have given me the opposite of what I asked for. [16:04] I asked for the files within it [16:05] Not a story of how it could have been done differently [16:05] i get a 404 error from your link [16:05] SketchCow: http://warctozip.herokuapp.com/ [16:12] Thank youuuuuu, Deewiant [16:18] Nemo_bis: It's not likely to happen soon. We don't actually use/buy/license ABBYY [16:19] We use LuraTech's command-line ocr tools, which happen to integrate the ABBYY SDK [16:20] and (I don't think LuraTech has the abbyy 11 engine out yet, but if they did), upgrading all our OCR headless licenses would be very expensive, so it's likely to be forever a rainy day project [16:21] I've played around with the open source OCR solutions [16:21] OCR works fairly well, but putting the OCR into the PDFs doesn't. [16:22] one issue I've noticed with the multilingual OCR in particular is that IA defaults everything to english and does the OCR as english, then if you edit the language later (which you have to do because the initial upload has no field for it) it doesn't re-ocr [16:22] unless you manually re-derive * [16:22] DFJustin: Yeah, it's rather annoying [16:22] new uploader fixes that [16:22] archive.org/upload has a lang field [16:23] ah [16:23] balrog_: What were you using to put the OCR into PDFs? hOCR output with the appropriate tool worked okay the last time I tried. [16:23] I tried every hOCR tool I could find [16:23] which did you use? [16:23] PDFBeads. [16:23] it might be worth doing an automated sweep for items where the OCRed language != the currently set language and rederiving [16:23] Also, just making a note. We've been uploading 80-100GB of files through archive.org/upload with chrome. [16:23] That's SO INSANE [16:23] that one messes with the tiffs more than I'd like [16:23] How so? [16:24] I want the tiffs embedded in the PDFs *as I provide them* [16:24] it breaks them into a foreground and a background [16:26] ah, yeah [16:26] luratech does that too [16:26] otherwise the PDFs would be hundreds and hundreds of MB [16:26] Hm, could have sworn that was optional, since you can feed preprocessed TIFFs into it. [16:26] But yeah, if you don't do that, size gets crazy big. [16:26] oh, it could be. I'm not very familiar with the OCR workflow [16:27] I do a ton of processing to the TIFFs ahead of time [16:27] so I get the page sizes down to usually a few hundred kb per page [16:28] Nemo_bis: ia701206 was hung on reboot, coming up now [16:28] just fyi [16:29] underscor: thanks [16:30] balrog_: That's my workflow too - though typically I did the optimizing ahead of time and let pdfbeads handle jbig2 compression for size. [16:30] Ugh, archive.org is getting hammered with SO MUCH SPAM [16:30] :c [16:31] You don't want to archive all the spam? [16:31] http://archive.org/details/BuyFluconazole_269 [16:31] it's not even good spam [16:31] I like the good spam [16:33] :( True. I like some of the really weird stuff that ends up on the wiki, but then I'm not the one cleaning it up. [16:38] so, jabber/xmpp, can i use the same account simultanously on two machines? [16:40] I *think* so [16:40] I mean, I'm logged into my google talk account in like 4 places [16:40] so [16:40] yeah some of them upload ad videos into community video which is kinda nice, self-archiving [16:41] the remote-linked pill ads are just lame though [16:41] data de-dup is needed for archive.org to handle the ads archive [16:41] I find the review spam more annoying though [16:42] at least that is not uploading stuff [16:42] ->bs [17:09] found a full call for help episode from 2001 [17:10] it maybe the full hour with ads in it [17:11] which is good cause we can then save some techtv or zdtv ads [21:52] http://archive.org/details/musopen-dvd just went live. [21:52] Enjoy [22:56] this project is something I think people in here would like: http://www.indiegogo.com/avgeeks100miles?c=home [23:11] ooh, cool [23:11] I wish they were using a mueller dataframscanner instead of a telecine, but meh [23:57] I wish they were using something like the mueller dataframescanner but with much higher resolution [23:57] Coderjoe: this is for what?