#archiveteam 2012-08-10,Fri

↑back Search

Time Nickname Message
00:16 πŸ”— tef alard: i've been given the green light to push proxy code into warctools.
00:39 πŸ”— mistym shaqfu: Probably one of those things that disappeared with Versiontracker or whatever :/
00:40 πŸ”— shaqfu mistym: Hopefully not. Going to wait until Sunday (2 wks since I emailed) and try his work address
00:41 πŸ”— shaqfu Since I'm not aware of any other checksum software that'll gracefully handle MacOS files
00:44 πŸ”— mistym I guess you already looked on cpan...?
00:47 πŸ”— shaqfu Hm, didn't think to. Trying now
00:48 πŸ”— shaqfu Nope
00:49 πŸ”— mistym Darn. Would have been nice to have been that easy!
00:50 πŸ”— mistym Dealing with Mac files can be funny. Once dealt with an open source project that only supported using a set of Mac resource files if you obtained them on a non-Mac platform. If you had the Mac files on a Mac computer you were out of luck.
00:50 πŸ”— shaqfu Weird - I guess it couldn't handle the forks?
00:51 πŸ”— mistym Yeah, that was the problem. Only way it handled the forks was in a special MacBinary file format which was invented by a Windows tool in order to keep the forks together on non-Mac filesystems.
00:52 πŸ”— shaqfu mistym: And now you know why I want something that handles forks natively :)
00:52 πŸ”— mistym Exactly ;)
00:58 πŸ”— illunatic http://blog.greenpirate.org/hackitat-a-film-about-political-hacking/
00:58 πŸ”— illunatic this oughter be a good'n
01:40 πŸ”— shaqfu Oh, neat - might have a shot at "20-30 boxes worth" of old computer mags
01:40 πŸ”— shaqfu ASking for titles/ranges now
07:43 πŸ”— Nemo_bis hm I think I've exhausted ia601206's disk
07:43 πŸ”— Nemo_bis http://www.us.archive.org/log_show.php?task_id=117980926 http://www.us.archive.org/log_show.php?task_id=117982231 http://www.us.archive.org/log_show.php?task_id=117985215 http://www.us.archive.org/log_show.php?task_id=117988133
07:45 πŸ”— Nemo_bis no, must be something else
07:50 πŸ”— Coderjoe waiting for kisk fix
07:50 πŸ”— Coderjoe disk
07:54 πŸ”— Coderjoe mmm
07:54 πŸ”— Coderjoe not too often you see /dev/sdal1 (SDAL1)
07:56 πŸ”— underscor Nemo_bis: ia701206 is down, which is why that happened
07:57 πŸ”— Nemo_bis ok
08:03 πŸ”— underscor Nemo_bis: you broke it! http://i.imgur.com/7Bs6Z.png
08:03 πŸ”— underscor :D
08:03 πŸ”— underscor it'll probably be down until tomorrow unless one of the ops guys happens to wake up and see the message
08:05 πŸ”— emijrp imagine all the IA boxes with messages like that, but black background and saying: DATA LOST
08:05 πŸ”— underscor haha
08:06 πŸ”— underscor and this playing in the background http://www.freesound.org/people/murcielago123/sounds/81459/
08:07 πŸ”— underscor even better
08:07 πŸ”— underscor http://www.freesound.org/people/jleedent/sounds/82392/
08:08 πŸ”— emijrp that is the music + ia members shouting and running
08:08 πŸ”— underscor http://www.freesound.org/people/DaveCarter/sounds/109139/
08:30 πŸ”— kennethre SketchCow: your follow up is fucking awesome
08:31 πŸ”— SketchCow I even shaved
08:32 πŸ”— ersi :-O!
08:45 πŸ”— SmileyG hmmm
08:45 πŸ”— SmileyG love it when one of our devs randomly decides to reboot a server
08:45 πŸ”— SmileyG and billions of nagios warnings go off
08:45 πŸ”— SmileyG people going "woaaah???"
08:53 πŸ”— BlueMax follow up?
08:54 πŸ”— BlueMax oh wow he actually put up the KS
09:10 πŸ”— SketchCow http://www.bytecellar.com/photo_pano.html
09:20 πŸ”— Coderjoe i should try and see if my atari 520st still works
09:21 πŸ”— Coderjoe my at&t pc6300 is long dead and shot up
10:40 πŸ”— C-Keen I just have to admit that I always got something out of watching data getting transferred, watching data flow into the archive brings back memories from modem days...
12:31 πŸ”— balrog_ hmm
12:31 πŸ”— balrog_ supposedly linux 0.02 through 0.10 source code has been lost
12:31 πŸ”— balrog_ I wonder if it's hiding someplace
12:34 πŸ”— emijrp balrog_: i dont think so http://archive.org/details/git-history-of-linux
12:37 πŸ”— balrog_ wow, http://torrentfreak.com/demonoid-raid-credited-to-ifpi-multiple-arrests-in-mexico-reported-120809/
12:39 πŸ”— Schbirid fuck torrentfreak, they are such a bad piece of hyping tabloid rubbish money making poop machine
12:39 πŸ”— Schbirid http://www.ifpi.org/content/section_news/20120809.html
12:39 πŸ”— Schbirid rarely linking to their sources
12:40 πŸ”— Schbirid *proud to have reject their "offer" to write for them years ago*
12:40 πŸ”— Schbirid gawker should buy them
12:40 πŸ”— Schbirid rant over
12:40 πŸ”— emijrp yahoo
12:40 πŸ”— Schbirid haha, yes
12:40 πŸ”— Schbirid that would be a great day. yahoo buying gawker and huffpo
12:42 πŸ”— balrog_ lol, why?
12:45 πŸ”— Schbirid because yahoo kills things
12:47 πŸ”— ersi Use #archiveteam-bs dudes
12:53 πŸ”— godane i may do a warc.gz of ifpi.org now
12:53 πŸ”— godane since a lot of pdfs are on there
13:01 πŸ”— godane part me wonders how big ifpi.org is to download
13:01 πŸ”— Schbirid i will not visit you in guantanamo!
13:02 πŸ”— godane cause i'm mirroring there site?
13:05 πŸ”— godane looks like wayback magazine has spotly archives of ifpi.org
13:06 πŸ”— godane 2005 seams to have been the year it was archived the most
14:40 πŸ”— Nemo_bis I wonder if someone will answer: http://archive.org/post/426995/finereader-11
14:43 πŸ”— emijrp i think that the best OCR solution for IA books is software like this http://beta.fromthepage.com/display/display_page?ol=w_rw_p_pl&page_id=378
14:43 πŸ”— emijrp crowdsourcing OCR
14:44 πŸ”— Schbirid yeah
14:49 πŸ”— Nemo_bis it's a horrible waste of time if the starting OCR is not good
14:49 πŸ”— godane you do archive the non-ocr files right?
14:49 πŸ”— godane just thinking that way you can re-ocr the files again
14:50 πŸ”— emijrp another idea, Internet Archive reCAPTCHA
14:51 πŸ”— Nemo_bis emijrp: https://www.mediawiki.org/wiki/Requests_for_comment/CAPTCHA
14:52 πŸ”— emijrp have you traveled to any wikimaniaÇ?
14:52 πŸ”— Schbirid heh, those Image CAPTCHAs are broken already
14:54 πŸ”— Nemo_bis emijrp: a couple
14:55 πŸ”— godane uploaded: http://archive.org/details/ifpi.org-20120810-mirror
15:00 πŸ”— godane uploading more gbtv while download techcrunch
16:01 πŸ”— SketchCow OK, small challenge.
16:02 πŸ”— SketchCow Small.
16:02 πŸ”— SketchCow http://fos.textfiles.com/web.me.com-shindelltravels.warc.gz
16:02 πŸ”— SketchCow give me the files inside it
16:02 πŸ”— SketchCow I'm fucking sick of fucking with .warc.gz
16:03 πŸ”— godane SketchCow: thats why i always upload a .tar.gz of my mirror dumps
16:03 πŸ”— godane and warc.gz for the archive waybackmagazine
16:04 πŸ”— SketchCow Thanks.
16:04 πŸ”— SketchCow You have given me the opposite of what I asked for.
16:04 πŸ”— SketchCow I asked for the files within it
16:05 πŸ”— SketchCow Not a story of how it could have been done differently
16:05 πŸ”— godane i get a 404 error from your link
16:05 πŸ”— Deewiant SketchCow: http://warctozip.herokuapp.com/
16:12 πŸ”— SketchCow Thank youuuuuu, Deewiant
16:18 πŸ”— underscor Nemo_bis: It's not likely to happen soon. We don't actually use/buy/license ABBYY
16:19 πŸ”— underscor We use LuraTech's command-line ocr tools, which happen to integrate the ABBYY SDK
16:20 πŸ”— underscor and (I don't think LuraTech has the abbyy 11 engine out yet, but if they did), upgrading all our OCR headless licenses would be very expensive, so it's likely to be forever a rainy day project
16:21 πŸ”— balrog_ I've played around with the open source OCR solutions
16:21 πŸ”— balrog_ OCR works fairly well, but putting the OCR into the PDFs doesn't.
16:22 πŸ”— DFJustin one issue I've noticed with the multilingual OCR in particular is that IA defaults everything to english and does the OCR as english, then if you edit the language later (which you have to do because the initial upload has no field for it) it doesn't re-ocr
16:22 πŸ”— DFJustin unless you manually re-derive *
16:22 πŸ”— underscor DFJustin: Yeah, it's rather annoying
16:22 πŸ”— underscor new uploader fixes that
16:22 πŸ”— underscor archive.org/upload has a lang field
16:23 πŸ”— DFJustin ah
16:23 πŸ”— mistym balrog_: What were you using to put the OCR into PDFs? hOCR output with the appropriate tool worked okay the last time I tried.
16:23 πŸ”— balrog_ I tried every hOCR tool I could find
16:23 πŸ”— balrog_ which did you use?
16:23 πŸ”— mistym PDFBeads.
16:23 πŸ”— DFJustin it might be worth doing an automated sweep for items where the OCRed language != the currently set language and rederiving
16:23 πŸ”— underscor Also, just making a note. We've been uploading 80-100GB of files through archive.org/upload with chrome.
16:23 πŸ”— underscor That's SO INSANE
16:23 πŸ”— balrog_ that one messes with the tiffs more than I'd like
16:23 πŸ”— mistym How so?
16:24 πŸ”— balrog_ I want the tiffs embedded in the PDFs *as I provide them*
16:24 πŸ”— balrog_ it breaks them into a foreground and a background
16:26 πŸ”— underscor ah, yeah
16:26 πŸ”— underscor luratech does that too
16:26 πŸ”— underscor otherwise the PDFs would be hundreds and hundreds of MB
16:26 πŸ”— mistym Hm, could have sworn that was optional, since you can feed preprocessed TIFFs into it.
16:26 πŸ”— mistym But yeah, if you don't do that, size gets crazy big.
16:26 πŸ”— underscor oh, it could be. I'm not very familiar with the OCR workflow
16:27 πŸ”— balrog_ I do a ton of processing to the TIFFs ahead of time
16:27 πŸ”— balrog_ so I get the page sizes down to usually a few hundred kb per page
16:28 πŸ”— underscor Nemo_bis: ia701206 was hung on reboot, coming up now
16:28 πŸ”— underscor just fyi
16:29 πŸ”— Nemo_bis underscor: thanks
16:30 πŸ”— mistym balrog_: That's my workflow too - though typically I did the optimizing ahead of time and let pdfbeads handle jbig2 compression for size.
16:30 πŸ”— underscor Ugh, archive.org is getting hammered with SO MUCH SPAM
16:30 πŸ”— underscor :c
16:31 πŸ”— mistym You don't want to archive all the spam?
16:31 πŸ”— underscor http://archive.org/details/BuyFluconazole_269
16:31 πŸ”— underscor it's not even good spam
16:31 πŸ”— underscor I like the good spam
16:33 πŸ”— mistym :( True. I like some of the really weird stuff that ends up on the wiki, but then I'm not the one cleaning it up.
16:38 πŸ”— Schbirid so, jabber/xmpp, can i use the same account simultanously on two machines?
16:40 πŸ”— underscor I *think* so
16:40 πŸ”— underscor I mean, I'm logged into my google talk account in like 4 places
16:40 πŸ”— underscor so
16:40 πŸ”— DFJustin yeah some of them upload ad videos into community video which is kinda nice, self-archiving
16:41 πŸ”— DFJustin the remote-linked pill ads are just lame though
16:41 πŸ”— godane data de-dup is needed for archive.org to handle the ads archive
16:41 πŸ”— DFJustin I find the review spam more annoying though
16:42 πŸ”— godane at least that is not uploading stuff
16:42 πŸ”— DFJustin ->bs
17:09 πŸ”— godane found a full call for help episode from 2001
17:10 πŸ”— godane it maybe the full hour with ads in it
17:11 πŸ”— godane which is good cause we can then save some techtv or zdtv ads
21:52 πŸ”— SketchCow http://archive.org/details/musopen-dvd just went live.
21:52 πŸ”— SketchCow Enjoy
22:56 πŸ”— dashcloud this project is something I think people in here would like: http://www.indiegogo.com/avgeeks100miles?c=home
23:11 πŸ”— underscor ooh, cool
23:11 πŸ”— underscor I wish they were using a mueller dataframscanner instead of a telecine, but meh
23:57 πŸ”— Coderjoe I wish they were using something like the mueller dataframescanner but with much higher resolution
23:57 πŸ”— balrog_ Coderjoe: this is for what?

irclogger-viewer