#archiveteam 2013-01-07,Mon

↑back Search

Time	Nickname	Message
00:12 ^🔗	ivan`	anyone backed up any of what.cd?
00:13 ^🔗	ivan`	esp. collages
00:15 ^🔗	DFJustin	yes
00:22 ^🔗	Aranje	^
01:20 ^🔗	turnkit	Archiving some CD-ROMs using imgburn. Anyone confirm if imgburn will create a .cue/.bin automatically if the disc is dual mode (mode1 and mode 2 -- i.e. data + audio)? So far all the discs are ripping as .iso. I just want to make sure a dual mode disc will be archive correctly and that I'm using the right tool.
03:06 ^🔗	DFJustin	turnkit: yes imgburn will automatically switch to bin/cue for anything that's not vanilla mode1
03:07 ^🔗	DFJustin	what kind of stuff are you archiving
03:29 ^🔗	SketchCow	I hope it's awesome stuff
03:29 ^🔗	SketchCow	DFJustin: You're able to fling stuff into cdbbsarchive, I see. Good.
03:30 ^🔗	DFJustin	have been for the better part of the last year :P
03:30 ^🔗	SketchCow	I figured, just figured I'd say it.
03:32 ^🔗	DFJustin	I'd appreciate a mass bchunk run when you get a chance, would be more efficient than converting the stuff locally and uploading it over cable
03:33 ^🔗	SketchCow	Explain bchunk run here (I'm tired, drove through 5 states)
03:33 ^🔗	SketchCow	What did bchunk do again
03:34 ^🔗	DFJustin	convert .bin/.cue into browseable .iso
03:34 ^🔗	SketchCow	Oh that's right.
03:34 ^🔗	SketchCow	I can do that.
03:34 ^🔗	SketchCow	Right now, however, I'm working on a bitsavers ingestor
03:34 ^🔗	DFJustin	>:D
03:34 ^🔗	SketchCow	If I get that working, boom, 25,000 documents
03:35 ^🔗	DFJustin	al's been flinging dumps of one obscure-ass system after another towards messdev
03:35 ^🔗	SketchCow	One little thing, though.
03:35 ^🔗	SketchCow	https://archive.org/details/cdrom-descent-dimensions
03:36 ^🔗	SketchCow	Try to avoid going "this was sued out of existence, haw haw" in a description.
03:36 ^🔗	DFJustin	hah yeah I guess so, that was a copy-paste from elsewhere
03:36 ^🔗	SketchCow	And yes, Al's been on a hell of a run.
03:36 ^🔗	SketchCow	To be honest, bitsavers is terrifying at how much it's bringing in now.
03:38 ^🔗	SketchCow	wc -l IndexByDate.txt 25872 IndexByDate.txt
03:38 ^🔗	SketchCow	So 25,872 individual PDFs
03:40 ^🔗	godane	SketchCow: i uploading my official xbox magazine pdfs
03:40 ^🔗	SketchCow	Yes.
03:40 ^🔗	SketchCow	You spelled Magazine wrong.
03:40 ^🔗	SketchCow	it's rather exciting to watch
03:41 ^🔗	SketchCow	Also, they're almost guaranteed to disappear - Future Publishing is a strong, modern entity.
03:42 ^🔗	godane	yes but this stuff is giving away for free
03:44 ^🔗	godane	http://www.oxmonline.com/secretstash
03:45 ^🔗	SketchCow	Hmm, I may be wrong.
03:45 ^🔗	SketchCow	Worth seeing.
03:48 ^🔗	godane	i'm also fixing the typo
03:50 ^🔗	SketchCow	So the good news is by mistake, I uploaded a script to archive.org with my s3 keys in it.
03:50 ^🔗	SketchCow	I immediately deleted it when I saw what the script had done, of course.
03:50 ^🔗	SketchCow	But therefore, I HAD to smash my s3 keys
03:50 ^🔗	SketchCow	Therefore I HAVE to smash all my scripts that use those keys.
03:51 ^🔗	SketchCow	Therefore, I HAVE NO REASON not to make them all access an external file for their s3 key information.
03:51 ^🔗	SketchCow	AND FINALLY THEREFORE, that means as I finish these new scripts, I can drop them on people for github or whatever
03:54 ^🔗	SketchCow	And then I'm going after that mkepunk.com site because holy shit
05:26 ^🔗	Aranje	oooh
05:59 ^🔗	turnkit	DFJustin: MacAddict set. I collected about 70% of the set but only maybe 50% of the sleeves. Unsure when sleeves stopped and restarted. I've got a lot of silkscreen (cd) and sleeve scanning left but am uploading slowly now. Need to find a source for the missing discs still. Am trying to track them down.
06:01 ^🔗	turnkit	DFJustin: thx for the imgburn confirmation. Puts my mind at ease.
06:17 ^🔗	SketchCow	It's written!
06:17 ^🔗	SketchCow	I just started doing it.
06:17 ^🔗	SketchCow	I can do masses of specific machine uploads
06:18 ^🔗	SketchCow	And as you can see, it refuses to upload twice.
06:18 ^🔗	SketchCow	[!] beehive/118010M_A_Family_of_Video_Display_Terminals_for_OEM_Planning_Brochure.pdf already uploaded!
06:18 ^🔗	SketchCow	[!] beehive/BM1912RAM08-A_Model_DM20_Data_Sheet_March1980.pdf already uploaded!
06:18 ^🔗	SketchCow	[!] beehive/BM1912RAM08-B_Model_DM1A_Data_Sheet_March1980.pdf already uploaded!
06:18 ^🔗	SketchCow	[!] beehive/BM1912RAM08-C_Model_DM30_Data_Sheet_March1980.pdf already uploaded!
06:18 ^🔗	SketchCow	[!] beehive/BM1912RAM08-D_Model_DM10_Data_Sheet_March1980.pdf already uploaded!
06:18 ^🔗	SketchCow	[!] beehive/BM1912RAM08-E_Model_DM_S_Data_Sheet_March1980.pdf already uploaded!
06:18 ^🔗	SketchCow	[!] beehive/BM1913MAR80-Micro_4400_Brochure_March1980.pdf already uploaded!
06:18 ^🔗	SketchCow	[!] beehive/Beehive_Service_Field_Engineering_Brochure.pdf already uploaded!
06:18 ^🔗	SketchCow	[root@www /usr/home/jason/BITARCHIVERS]# for each in `grep beehive Date \| cut -f3- -d" "`;do ./bitsaver "$each";done
06:19 ^🔗	hiker1	Does that upload to Archive.org?
06:20 ^🔗	SketchCow	Yes, that's what this is.
06:22 ^🔗	SketchCow	http://archive.org/details/bitsavers_beehive118isplayTerminalsforOEMPlanningBrochure_9464917
06:22 ^🔗	SketchCow	for example
06:38 ^🔗	DFJustin	bits: fukken saved
06:40 ^🔗	SketchCow	yeah
06:40 ^🔗	SketchCow	Well, more that they're now up in the archive.org collection.
06:40 ^🔗	SketchCow	Now, remember the big second part to my doing this - pulling the item information into a wiki so people can edit it and we can sync it back to archive.org
06:41 ^🔗	SketchCow	That'll be the big thing
06:41 ^🔗	SketchCow	Then we can set it up for other collections.
06:41 ^🔗	SketchCow	total hack around internet archive not buying into collaborative metadata
06:42 ^🔗	SketchCow	http://archive.org/details/bitsavers_motorola68_1675238 is what we have for now
06:49 ^🔗	SketchCow	https://archive.org/details/bitsavers_atari40080mTextEditor1981_3442791
07:02 ^🔗	turnkit	love that updating solutions are being addressed.
07:02 ^🔗	SketchCow	has to be.
07:02 ^🔗	SketchCow	this is the year!
07:02 ^🔗	SketchCow	I am going to fix some shit. I know how things work now, and we're going to get cruising.
07:05 ^🔗	chronomex	fix the shit out of it this year
07:07 ^🔗	SketchCow	Yeah
07:07 ^🔗	SketchCow	And among that is fixing the metadata thing.
07:07 ^🔗	SketchCow	So I'll need help with that, as we make a second wiki for pulling in collections to work on
07:22 ^🔗	SketchCow	x-archive-meta-title:ibm :: 1130 :: subroutines :: 00.3.003 Magnetic Tape Subroutines For Assembler and Fortran Compiled Programs for the IBM 1130
07:22 ^🔗	SketchCow	Drool
07:23 ^🔗	chronomex	ooo
07:23 ^🔗	chronomex	now you're talking
07:24 ^🔗	hiker1	You guys saved all those old documents for atari?
07:25 ^🔗	hiker1	wow
07:39 ^🔗	godane	is trs-80 microcomputer news upload to archive.org?
07:42 ^🔗	godane	i only ask cause looks like romsheperd has it
08:41 ^🔗	Nemo_bis	Nice, SketchCow it keeping the OCR boxes busy. :)
08:54 ^🔗	godane	all of the free xbox magazines are uploaded now
08:54 ^🔗	hiker1	good work
09:04 ^🔗	Nemo_bis	godane: come on, only 835 texts uploads? You can do better. ;-) https://archive.org/search.php?query=uploader%3Aslaxemulator%20AND%20mediatype%3Atexts
09:05 ^🔗	godane	why is it that some of my website dumps are in texts?
09:05 ^🔗	godane	i know i upload it that way but jason moved it to archiveteam-file
09:06 ^🔗	godane	i do notice it doesn't have the archiveteam-file web interface
09:07 ^🔗	godane	SketchCow: i think you need to change the mediatype to some of my webdumps and iso files
09:08 ^🔗	godane	you only put into the collection
09:08 ^🔗	godane	without the mediatype change it doesn't get the collection web interface
09:10 ^🔗	godane	SketchCow: you also missed one of my groklaw.net pdfs dumps: https://archive.org/details/groklaw.net-pdfs-2007-20120827
09:10 ^🔗	godane	even though its pdfs there in a warc.gz and tar.gz
09:11 ^🔗	godane	this was needs a it mediatype chnaged: https://archive.org/details/TechTV_Computer_Basics_with_Chris_Pirillo_and_Kate_Botello
14:12 ^🔗	hiker1	Does anyone here use warc-proxy on Linux or OS x?
14:54 ^🔗	alard	hiker1: I use Linux.
14:55 ^🔗	hiker1	ah. I was just wondering how portable it was. Apparently very.
14:55 ^🔗	alard	You're using it on Windows? I find that even more interesting.
14:56 ^🔗	hiker1	heh. well, it works here too, with no problems
14:59 ^🔗	alard	Have you used the Firefox extension as well? Apparently OS X is (or was, perhaps) more difficult https://github.com/alard/warc-proxy/issues/1
14:59 ^🔗	hiker1	No, I didn't try it. My FireFox is so slow already ;_;
14:59 ^🔗	hiker1	plus I haven't restarted it in god knows how long
15:00 ^🔗	hiker1	That is an old ticket!
15:05 ^🔗	hiker1	alard: It probably doesn't like having the two extensions
15:05 ^🔗	alard	This was before the two extensions. At that time WARC *.warc.gz was the only file type (and there wasn't an All files option).
15:06 ^🔗	hiker1	I mean .warc.gz
15:06 ^🔗	hiker1	intead of .gz
15:06 ^🔗	alard	Ah, I see.
15:06 ^🔗	hiker1	That would be my guess
15:07 ^🔗	hiker1	Probably not much is lost by changing it to just .gz.
15:07 ^🔗	alard	But does the 'filename extension' even exist in OS X? I thought that was a Windows thing.
15:07 ^🔗	hiker1	I don't know
15:08 ^🔗	Smiley	hmmm
15:08 ^🔗	Smiley	is it like the linux version?
15:09 ^🔗	Smiley	If it exists, it obeys it (or complains about it) and if it doesn't exist, it doesn't care?
15:10 ^🔗	alard	Could this have something to do with it? https://bugzilla.mozilla.org/show_bug.cgi?id=444423
15:14 ^🔗	hiker1	might. would have to see the Firefox code for nsIFilePicker
15:15 ^🔗	hiker1	well, for the OS X code. that is just an interface
15:20 ^🔗	hiker1	That bug is probably related. I guess they'll be fixing it any time now. Yep, any year now. Maybe before the decade is out? Well, two decades?
15:28 ^🔗	godane	uploaded: http://archive.org/details/Call.For.Help.Canada.2004.08.17
15:29 ^🔗	godane	SketchCow: there is going to a call for help canada collection soon in computers and tech videos
15:29 ^🔗	godane	i also plan on doing the same thing for all of the the screen savers episodes i have
15:59 ^🔗	hiker1	What is call for help canada?
16:00 ^🔗	hiker1	tech tv show
16:00 ^🔗	hiker1	godane: How are you going to create the collection?
16:00 ^🔗	godane	i'm not create the collection
16:01 ^🔗	godane	but jason scott puts my files into a collection
16:01 ^🔗	hiker1	ah
16:13 ^🔗	SketchCow	He lights a candle and I am there
16:17 ^🔗	godane	the screen savers have like the last 8 months of
16:17 ^🔗	hiker1	IIPC is working to create an archiving proxy: http://netpreserve.org/projects/live-archiving-http-proxy
16:18 ^🔗	godane	i'm thinking of something crazy that could in theory save space
16:18 ^🔗	hiker1	?
16:18 ^🔗	godane	like merging multiable warc.gz of the same site into sort of a megawarc
16:19 ^🔗	hiker1	won't help
16:19 ^🔗	godane	i was thinking of a way to dedup multiable warc.gz
16:19 ^🔗	hiker1	all you need to do is use actual compression instead of warc.gz files.
16:20 ^🔗	hiker1	warc.gz is not a true compressed file. It is a bunch of compressed files merged together. Each HTML file is compressed by itself without knowledge of the other html files.
16:20 ^🔗	alard	SketchCow: Semantic MediaWiki + Semantic Forms might be something to look at for your metadata-wiki.
16:20 ^🔗	godane	this way something like my 5 torrentfreak warc.gz can be in one mega warc but alot smaller
16:20 ^🔗	hiker1	This severely hurts compression
16:20 ^🔗	alard	You can make structured forms like these: http://hackerspaces.org/w/index.php?title=The_1st_SPACE_Retreat&action=formedit
16:21 ^🔗	hiker1	godane: Just extract a .warc.gz file, then compress it with .7z. Alternatively, extract a warc file and compress with 7-zip along with another .warc file
16:22 ^🔗	alard	Is compression that important?
16:22 ^🔗	godane	hiker1: i'm thinking something that can still work with wayback machine
16:22 ^🔗	hiker1	alard: Yes... especially for transferring files to IA.
16:22 ^🔗	hiker1	and it saves bandwidth for people downloading from IA
16:22 ^🔗	alard	But you're making your warc files much, much less useful.
16:23 ^🔗	godane	i'm also thinking of sort way so IA could have a kit to setup archive.org at home of sorts
16:23 ^🔗	godane	dedup matters when your think in that way
16:23 ^🔗	hiker1	Does IA even use the original warc.gz file in production? I assume they use it to feed the machine, but then they could have just extracted a 7z and fed the machine with that
16:24 ^🔗	alard	I don't know. The wayback machine that you can download from GitHub certainly reads from .warc.gz files.
16:25 ^🔗	hiker1	but I'm guessing IA has two copies
16:25 ^🔗	hiker1	one as the Item, and one for the machine
16:25 ^🔗	hiker1	I am not sure though, it's just a guess
16:28 ^🔗	hiker1	For a continuous crawler, you could save space by checking old records in a WARC file and then adding revisit records as necessary. But this would not work with any current WARC readers (wayback machine, warc-proxy, etc.)
16:29 ^🔗	hiker1	continuous crawler being e.g. one that crawls every week to check for changes to the site
16:31 ^🔗	hiker1	godane: If you have the WARC ISO draft, it discusses this type of example on p. 27
16:39 ^🔗	godane	hiker1: again i was thinking of a way to dedup multiable warc into one big warc
16:39 ^🔗	godane	not revisting records in another file
16:41 ^🔗	godane	the idea is to store the file once
16:41 ^🔗	hiker1	godane: If you had two snapshots, warc1 taken first then warc2. You could run through warc2 and see if the HTTP body matches the record in warc1. If it matches, append a revisit record to warc1. If they are different, append the new record to warc1.
16:42 ^🔗	hiker1	godane: Two snapshots of the same site, right?
16:42 ^🔗	hiker1	or do you mean two warc files with different contents, e.g. html in one and images in another?
16:43 ^🔗	godane	i mean content with the same md5sum or checksum
16:44 ^🔗	hiker1	Yes, as I said, run through the two and check if the http body is the same
16:44 ^🔗	hiker1	WARC files already offer the payload-digest to check if the http body is the same.
16:44 ^🔗	godane	again i want the two to merge into one file
16:45 ^🔗	godane	also say that the site is dead
16:45 ^🔗	hiker1	that would merge them effectively
16:45 ^🔗	hiker1	after you did it, you could delete warc2.
16:45 ^🔗	hiker1	since all the contents would be in warc1
17:58 ^🔗	SketchCow	Just had a fascinating conversation with brewster about internet archive.
17:58 ^🔗	SketchCow	Talking about fundraising, still on his mind, how can that be done better next year.
18:00 ^🔗	SketchCow	one of the things I mentioned was Internet Archive taking over or bringing back certain "services" people have expected over the years.
18:00 ^🔗	SketchCow	So have a bank of virtual servers that are basically "this service"
18:00 ^🔗	SketchCow	brainstorming on that would be good at some point.
18:00 ^🔗	SketchCow	So basically, he's up for non-archive.org-interface things
18:02 ^🔗	SketchCow	Also, I am fucking DESTROYING the submit queue
18:02 ^🔗	SketchCow	\o/
18:05 ^🔗	*	Nemo_bis feels the derivers gasping
18:05 ^🔗	*	Nemo_bis laughs devilishly
18:14 ^🔗	SketchCow	I can already see a few dozen, maybe a couple hundred, will come out 'wrong'.
18:17 ^🔗	Nemo_bis	SketchCow: my meccano-magazine-* derives always failed when rerun all together. I had to rerun only ~100 at a time to avoid them timing out on solr_post.py or getting stuck with high load on OCR hosts.
18:18 ^🔗	SketchCow	Right.
18:18 ^🔗	SketchCow	We should make those a collection, huh.
18:18 ^🔗	SketchCow	I mean, it IS 650 issues
18:21 ^🔗	SketchCow	http://archive.org/details/meccano_magazine
18:22 ^🔗	beardicus	SketchCow, do you have an example service re: your talk with brewster?
18:25 ^🔗	beardicus	ah... maybe i see what you're getting at. eg: provide gopher archives through an actual gopher server instead of all webbed up in the archive.org interface.
18:31 ^🔗	SketchCow	http://archive.org/details/meccano_magazine is now coming along nicely.
18:31 ^🔗	SketchCow	Yes
18:31 ^🔗	SketchCow	That is exactly what I mean
18:31 ^🔗	SketchCow	"stuff"
19:03 ^🔗	hiker1	How is chunked HTTP encoding supposed to be handled in a WARC file?
19:03 ^🔗	hiker1	Should I just remove the chunked header from the response?
19:18 ^🔗	hiker1	alard: warc-proxy passes the transfer-encoding header. This seems to leave the connection open forever.
19:19 ^🔗	hiker1	for responses that have it set
19:28 ^🔗	hiker1	I think I might be saving my chunks wrong.
19:40 ^🔗	hiker1	No, I think I am saving them right. Hanzo warctools handles decoding the chunks, so I don't think warc-proxy should pass the transfer-encoding header since that would be telling the browser to handle the chunks.
19:50 ^🔗	swebb	That Adobe CS2 stuff can be found here: http://noelchenier.blogspot.ca/2013/01/download-adobe-cs2-for-free.html
20:00 ^🔗	SketchCow	Thanks.
20:00 ^🔗	SketchCow	Grabbing.
20:05 ^🔗	SketchCow	Up past 1,500 red rows on archive.org!
20:05 ^🔗	SketchCow	Deriver is dying on me
20:05 ^🔗	SketchCow	TAKE IT
20:05 ^🔗	SketchCow	TAKE ALL OF IT
20:06 ^🔗	Smiley	MOAR FATA!!!!
20:15 ^🔗	SketchCow	I had to stop it.
20:15 ^🔗	SketchCow	It's at 1,480.
20:16 ^🔗	SketchCow	I'll let things die down and do another massive submit after these get down
20:16 ^🔗	SketchCow	Or it'll murder a dev
20:45 ^🔗	alard	hiker1: Are your certain you have the latest warc-proxy? The latest version shouldn't send the Transfer-Encoding header: https://github.com/alard/warc-proxy/blob/master/warcproxy.py#L263
21:09 ^🔗	Nemo_bis	SketchCow: I so much miss having a ganglia graph for all servers to see the CPU load.
21:12 ^🔗	Nemo_bis	SketchCow: some meccano-magazine-* were not added to the collection (like half of them?), in case you don't know
21:13 ^🔗	Nemo_bis	eg https://archive.org/catalog.php?history=1&identifier=meccano-magazine-1966-03
21:29 ^🔗	godane	i got one of my smart computing magazines in the mail
21:29 ^🔗	godane	there is some highlighting and writing on the front
21:29 ^🔗	godane	but that can be fixed in gimp
21:31 ^🔗	Nemo_bis	omg godane's scans are RETOUCHED
21:34 ^🔗	godane	the original will be posted too
21:34 ^🔗	Nemo_bis	j/k
21:34 ^🔗	godane	the cover mostly had the writing on the front
21:34 ^🔗	godane	where this white
21:36 ^🔗	godane	omg: http://www.ebay.com/sch/Magazine-Back-Issues-/280/i.html?_ipg=25&_from=&_nkw=&_armrs=1&_ssn=treasures-again
21:37 ^🔗	godane	who is willing to give me money for archiving this?
21:38 ^🔗	godane	just know that all these items have about 3 days or less
21:40 ^🔗	DFJustin	is there any point in paying full price for this stuff on ebay, you can probably get stacks at your local salvation army for 25c a pop
21:42 ^🔗	godane	i'm in new england i don't know if local salvation army will have this stuff?
21:44 ^🔗	DFJustin	your local university library probably has them then
21:46 ^🔗	DFJustin	at least you should check places before paying $7 an issue for stuff that's not super old and rare
21:46 ^🔗	godane	DFJustin: i don't drive
21:47 ^🔗	godane	my brother is the one that drives
21:47 ^🔗	Nemo_bis	there's also surely someone with the cellar full of thousands of magazines if one can drive to them and collect them
21:47 ^🔗	Nemo_bis	but you don't, too bad
22:26 ^🔗	DFJustin	http://www.forbes.com/sites/adriankingsleyhughes/2013/01/07/download-adobe-cs2-applications-for-free/
22:29 ^🔗	DFJustin	* not actually free
22:35 ^🔗	db48x	lol
23:06 ^🔗	apokalypt	http://windowsphone.bboard.de/board/

irclogger-viewer