[00:12] anyone backed up any of what.cd? [00:13] esp. collages [00:15] yes [00:22] ^ [01:20] Archiving some CD-ROMs using imgburn. Anyone confirm if imgburn will create a .cue/.bin automatically if the disc is dual mode (mode1 and mode 2 -- i.e. data + audio)? So far all the discs are ripping as .iso. I just want to make sure a dual mode disc will be archive correctly and that I'm using the right tool. [03:06] turnkit: yes imgburn will automatically switch to bin/cue for anything that's not vanilla mode1 [03:07] what kind of stuff are you archiving [03:29] I hope it's awesome stuff [03:29] DFJustin: You're able to fling stuff into cdbbsarchive, I see. Good. [03:30] have been for the better part of the last year :P [03:30] I figured, just figured I'd say it. [03:32] I'd appreciate a mass bchunk run when you get a chance, would be more efficient than converting the stuff locally and uploading it over cable [03:33] Explain bchunk run here (I'm tired, drove through 5 states) [03:33] What did bchunk do again [03:34] convert .bin/.cue into browseable .iso [03:34] Oh that's right. [03:34] I can do that. [03:34] Right now, however, I'm working on a bitsavers ingestor [03:34] >:D [03:34] If I get that working, boom, 25,000 documents [03:35] al's been flinging dumps of one obscure-ass system after another towards messdev [03:35] One little thing, though. [03:35] https://archive.org/details/cdrom-descent-dimensions [03:36] Try to avoid going "this was sued out of existence, haw haw" in a description. [03:36] hah yeah I guess so, that was a copy-paste from elsewhere [03:36] And yes, Al's been on a hell of a run. [03:36] To be honest, bitsavers is terrifying at how much it's bringing in now. [03:38] wc -l IndexByDate.txt 25872 IndexByDate.txt [03:38] So 25,872 individual PDFs [03:40] SketchCow: i uploading my official xbox magazine pdfs [03:40] Yes. [03:40] You spelled Magazine wrong. [03:40] it's rather exciting to watch [03:41] Also, they're almost guaranteed to disappear - Future Publishing is a strong, modern entity. [03:42] yes but this stuff is giving away for free [03:44] http://www.oxmonline.com/secretstash [03:45] Hmm, I may be wrong. [03:45] Worth seeing. [03:48] i'm also fixing the typo [03:50] So the good news is by mistake, I uploaded a script to archive.org with my s3 keys in it. [03:50] I immediately deleted it when I saw what the script had done, of course. [03:50] But therefore, I HAD to smash my s3 keys [03:50] Therefore I HAVE to smash all my scripts that use those keys. [03:51] Therefore, I HAVE NO REASON not to make them all access an external file for their s3 key information. [03:51] AND FINALLY THEREFORE, that means as I finish these new scripts, I can drop them on people for github or whatever [03:54] And then I'm going after that mkepunk.com site because holy shit [05:26] oooh [05:59] DFJustin: MacAddict set. I collected about 70% of the set but only maybe 50% of the sleeves. Unsure when sleeves stopped and restarted. I've got a lot of silkscreen (cd) and sleeve scanning left but am uploading slowly now. Need to find a source for the missing discs still. Am trying to track them down. [06:01] DFJustin: thx for the imgburn confirmation. Puts my mind at ease. [06:17] It's written! [06:17] I just started doing it. [06:17] I can do masses of specific machine uploads [06:18] And as you can see, it refuses to upload twice. [06:18] [!] beehive/118010M_A_Family_of_Video_Display_Terminals_for_OEM_Planning_Brochure.pdf already uploaded! [06:18] [!] beehive/BM1912RAM08-A_Model_DM20_Data_Sheet_March1980.pdf already uploaded! [06:18] [!] beehive/BM1912RAM08-B_Model_DM1A_Data_Sheet_March1980.pdf already uploaded! [06:18] [!] beehive/BM1912RAM08-C_Model_DM30_Data_Sheet_March1980.pdf already uploaded! [06:18] [!] beehive/BM1912RAM08-D_Model_DM10_Data_Sheet_March1980.pdf already uploaded! [06:18] [!] beehive/BM1912RAM08-E_Model_DM_S_Data_Sheet_March1980.pdf already uploaded! [06:18] [!] beehive/BM1913MAR80-Micro_4400_Brochure_March1980.pdf already uploaded! [06:18] [!] beehive/Beehive_Service_Field_Engineering_Brochure.pdf already uploaded! [06:18] [root@www /usr/home/jason/BITARCHIVERS]# for each in `grep beehive *Date* | cut -f3- -d" "`;do ./bitsaver "$each";done [06:19] Does that upload to Archive.org? [06:20] Yes, that's what this is. [06:22] http://archive.org/details/bitsavers_beehive118isplayTerminalsforOEMPlanningBrochure_9464917 [06:22] for example [06:38] bits: fukken saved [06:40] yeah [06:40] Well, more that they're now up in the archive.org collection. [06:40] Now, remember the big second part to my doing this - pulling the item information into a wiki so people can edit it and we can sync it back to archive.org [06:41] That'll be the big thing [06:41] Then we can set it up for other collections. [06:41] total hack around internet archive not buying into collaborative metadata [06:42] http://archive.org/details/bitsavers_motorola68_1675238 is what we have for now [06:49] https://archive.org/details/bitsavers_atari40080mTextEditor1981_3442791 [07:02] love that updating solutions are being addressed. [07:02] has to be. [07:02] this is the year! [07:02] I am going to fix some shit. I know how things work now, and we're going to get cruising. [07:05] fix the shit out of it this year [07:07] Yeah [07:07] And among that is fixing the metadata thing. [07:07] So I'll need help with that, as we make a second wiki for pulling in collections to work on [07:22] x-archive-meta-title:ibm :: 1130 :: subroutines :: 00.3.003 Magnetic Tape Subroutines For Assembler and Fortran Compiled Programs for the IBM 1130 [07:22] Drool [07:23] ooo [07:23] now you're talking [07:24] You guys saved all those old documents for atari? [07:25] wow [07:39] is trs-80 microcomputer news upload to archive.org? [07:42] i only ask cause looks like romsheperd has it [08:41] Nice, SketchCow it keeping the OCR boxes busy. :) [08:54] all of the free xbox magazines are uploaded now [08:54] good work [09:04] godane: come on, only 835 texts uploads? You can do better. ;-) https://archive.org/search.php?query=uploader%3Aslaxemulator%20AND%20mediatype%3Atexts [09:05] why is it that some of my website dumps are in texts? [09:05] i know i upload it that way but jason moved it to archiveteam-file [09:06] i do notice it doesn't have the archiveteam-file web interface [09:07] SketchCow: i think you need to change the mediatype to some of my webdumps and iso files [09:08] you only put into the collection [09:08] without the mediatype change it doesn't get the collection web interface [09:10] SketchCow: you also missed one of my groklaw.net pdfs dumps: https://archive.org/details/groklaw.net-pdfs-2007-20120827 [09:10] even though its pdfs there in a warc.gz and tar.gz [09:11] this was needs a it mediatype chnaged: https://archive.org/details/TechTV_Computer_Basics_with_Chris_Pirillo_and_Kate_Botello [14:12] Does anyone here use warc-proxy on Linux or OS x? [14:54] hiker1: I use Linux. [14:55] ah. I was just wondering how portable it was. Apparently very. [14:55] You're using it on Windows? I find that even more interesting. [14:56] heh. well, it works here too, with no problems [14:59] Have you used the Firefox extension as well? Apparently OS X is (or was, perhaps) more difficult https://github.com/alard/warc-proxy/issues/1 [14:59] No, I didn't try it. My FireFox is so slow already ;_; [14:59] plus I haven't restarted it in god knows how long [15:00] That is an old ticket! [15:05] alard: It probably doesn't like having the two extensions [15:05] This was before the two extensions. At that time WARC *.warc.gz was the only file type (and there wasn't an All files option). [15:06] I mean .warc.gz [15:06] intead of .gz [15:06] Ah, I see. [15:06] That would be my guess [15:07] Probably not much is lost by changing it to just .gz. [15:07] But does the 'filename extension' even exist in OS X? I thought that was a Windows thing. [15:07] I don't know [15:08] hmmm [15:08] is it like the linux version? [15:09] If it exists, it obeys it (or complains about it) and if it doesn't exist, it doesn't care? [15:10] Could this have something to do with it? https://bugzilla.mozilla.org/show_bug.cgi?id=444423 [15:14] might. would have to see the Firefox code for nsIFilePicker [15:15] well, for the OS X code. that is just an interface [15:20] That bug is probably related. I guess they'll be fixing it any time now. Yep, any year now. Maybe before the decade is out? Well, two decades? [15:28] uploaded: http://archive.org/details/Call.For.Help.Canada.2004.08.17 [15:29] SketchCow: there is going to a call for help canada collection soon in computers and tech videos [15:29] i also plan on doing the same thing for all of the the screen savers episodes i have [15:59] What is call for help canada? [16:00] tech tv show [16:00] godane: How are you going to create the collection? [16:00] i'm not create the collection [16:01] but jason scott puts my files into a collection [16:01] ah [16:13] He lights a candle and I am there [16:17] the screen savers have like the last 8 months of [16:17] IIPC is working to create an archiving proxy: http://netpreserve.org/projects/live-archiving-http-proxy [16:18] i'm thinking of something crazy that could in theory save space [16:18] ? [16:18] like merging multiable warc.gz of the same site into sort of a megawarc [16:19] won't help [16:19] i was thinking of a way to dedup multiable warc.gz [16:19] all you need to do is use actual compression instead of warc.gz files. [16:20] warc.gz is not a true compressed file. It is a bunch of compressed files merged together. Each HTML file is compressed by itself without knowledge of the other html files. [16:20] SketchCow: Semantic MediaWiki + Semantic Forms might be something to look at for your metadata-wiki. [16:20] this way something like my 5 torrentfreak warc.gz can be in one mega warc but alot smaller [16:20] This severely hurts compression [16:20] You can make structured forms like these: http://hackerspaces.org/w/index.php?title=The_1st_SPACE_Retreat&action=formedit [16:21] godane: Just extract a .warc.gz file, then compress it with .7z. Alternatively, extract a warc file and compress with 7-zip along with another .warc file [16:22] Is compression that important? [16:22] hiker1: i'm thinking something that can still work with wayback machine [16:22] alard: Yes... especially for transferring files to IA. [16:22] and it saves bandwidth for people downloading from IA [16:22] But you're making your warc files much, much less useful. [16:23] i'm also thinking of sort way so IA could have a kit to setup archive.org at home of sorts [16:23] dedup matters when your think in that way [16:23] Does IA even use the original warc.gz file in production? I assume they use it to feed the machine, but then they could have just extracted a 7z and fed the machine with that [16:24] I don't know. The wayback machine that you can download from GitHub certainly reads from .warc.gz files. [16:25] but I'm guessing IA has two copies [16:25] one as the Item, and one for the machine [16:25] I am not sure though, it's just a guess [16:28] For a continuous crawler, you could save space by checking old records in a WARC file and then adding revisit records as necessary. But this would not work with any current WARC readers (wayback machine, warc-proxy, etc.) [16:29] continuous crawler being e.g. one that crawls every week to check for changes to the site [16:31] godane: If you have the WARC ISO draft, it discusses this type of example on p. 27 [16:39] hiker1: again i was thinking of a way to dedup multiable warc into one big warc [16:39] not revisting records in another file [16:41] the idea is to store the file once [16:41] godane: If you had two snapshots, warc1 taken first then warc2. You could run through warc2 and see if the HTTP body matches the record in warc1. If it matches, append a revisit record to warc1. If they are different, append the new record to warc1. [16:42] godane: Two snapshots of the same site, right? [16:42] or do you mean two warc files with different contents, e.g. html in one and images in another? [16:43] i mean content with the same md5sum or checksum [16:44] Yes, as I said, run through the two and check if the http body is the same [16:44] WARC files already offer the payload-digest to check if the http body is the same. [16:44] again i want the two to merge into one file [16:45] also say that the site is dead [16:45] that would merge them effectively [16:45] after you did it, you could delete warc2. [16:45] since all the contents would be in warc1 [17:58] Just had a fascinating conversation with brewster about internet archive. [17:58] Talking about fundraising, still on his mind, how can that be done better next year. [18:00] one of the things I mentioned was Internet Archive taking over or bringing back certain "services" people have expected over the years. [18:00] So have a bank of virtual servers that are basically "this service" [18:00] brainstorming on that would be good at some point. [18:00] So basically, he's up for non-archive.org-interface things [18:02] Also, I am fucking DESTROYING the submit queue [18:02] \o/ [18:05] * Nemo_bis feels the derivers gasping [18:05] * Nemo_bis laughs devilishly [18:14] I can already see a few dozen, maybe a couple hundred, will come out 'wrong'. [18:17] SketchCow: my meccano-magazine-* derives always failed when rerun all together. I had to rerun only ~100 at a time to avoid them timing out on solr_post.py or getting stuck with high load on OCR hosts. [18:18] Right. [18:18] We should make those a collection, huh. [18:18] I mean, it IS 650 issues [18:21] http://archive.org/details/meccano_magazine [18:22] SketchCow, do you have an example service re: your talk with brewster? [18:25] ah... maybe i see what you're getting at. eg: provide gopher archives through an actual gopher server instead of all webbed up in the archive.org interface. [18:31] http://archive.org/details/meccano_magazine is now coming along nicely. [18:31] Yes [18:31] That is exactly what I mean [18:31] "stuff" [19:03] How is chunked HTTP encoding supposed to be handled in a WARC file? [19:03] Should I just remove the chunked header from the response? [19:18] alard: warc-proxy passes the transfer-encoding header. This seems to leave the connection open forever. [19:19] for responses that have it set [19:28] I think I might be saving my chunks wrong. [19:40] No, I think I am saving them right. Hanzo warctools handles decoding the chunks, so I don't think warc-proxy should pass the transfer-encoding header since that would be telling the browser to handle the chunks. [19:50] That Adobe CS2 stuff can be found here: http://noelchenier.blogspot.ca/2013/01/download-adobe-cs2-for-free.html [20:00] Thanks. [20:00] Grabbing. [20:05] Up past 1,500 red rows on archive.org! [20:05] Deriver is dying on me [20:05] TAKE IT [20:05] TAKE ALL OF IT [20:06] MOAR FATA!!!! [20:15] I had to stop it. [20:15] It's at 1,480. [20:16] I'll let things die down and do another massive submit after these get down [20:16] Or it'll murder a dev [20:45] hiker1: Are your certain you have the latest warc-proxy? The latest version shouldn't send the Transfer-Encoding header: https://github.com/alard/warc-proxy/blob/master/warcproxy.py#L263 [21:09] SketchCow: I so much miss having a ganglia graph for all servers to see the CPU load. [21:12] SketchCow: some meccano-magazine-* were not added to the collection (like half of them?), in case you don't know [21:13] eg https://archive.org/catalog.php?history=1&identifier=meccano-magazine-1966-03 [21:29] i got one of my smart computing magazines in the mail [21:29] there is some highlighting and writing on the front [21:29] but that can be fixed in gimp [21:31] omg godane's scans are RETOUCHED [21:34] the original will be posted too [21:34] j/k [21:34] the cover mostly had the writing on the front [21:34] where this white [21:36] omg: http://www.ebay.com/sch/Magazine-Back-Issues-/280/i.html?_ipg=25&_from=&_nkw=&_armrs=1&_ssn=treasures-again [21:37] who is willing to give me money for archiving this? [21:38] just know that all these items have about 3 days or less [21:40] is there any point in paying full price for this stuff on ebay, you can probably get stacks at your local salvation army for 25c a pop [21:42] i'm in new england i don't know if local salvation army will have this stuff? [21:44] your local university library probably has them then [21:46] at least you should check places before paying $7 an issue for stuff that's not super old and rare [21:46] DFJustin: i don't drive [21:47] my brother is the one that drives [21:47] there's also surely someone with the cellar full of thousands of magazines if one can drive to them and collect them [21:47] but you don't, too bad [22:26] http://www.forbes.com/sites/adriankingsleyhughes/2013/01/07/download-adobe-cs2-applications-for-free/ [22:29] * not actually free [22:35] lol [23:06] http://windowsphone.bboard.de/board/