#archiveteam 2012-09-20,Thu

↑back Search

Time	Nickname	Message
00:22 ^🔗	dashcloud	fellow archivers, what's the best way to label a DVD ?
00:24 ^🔗	chronomex	write on paper envelope
00:33 ^🔗	SketchCow	Don't make it
00:48 ^🔗	SketchCow	People shouldn't be making DVDs at this late stage.
00:55 ^🔗	SketchCow	Buy a couple drives and get a place to store offsite
01:06 ^🔗	chronomex	^ yup
01:14 ^🔗	godane	i burn to bluray
01:14 ^🔗	godane	but i also have my data still on a drive
01:14 ^🔗	godane	this is only cause i'm poor
01:19 ^🔗	SketchCow	You'll be poorer when you loose your crap
01:19 ^🔗	SketchCow	Two 2gb drives: $200
01:20 ^🔗	underscor	s/g/t/
01:21 ^🔗	SketchCow	Shhhh
01:22 ^🔗	SketchCow	Dude, I almost had the cash
02:11 ^🔗	dashcloud	SketchCow: the reason I'm making DVDs is I've still got a considerable collection of commercial VHS tapes lying around, and I'm converting them to digital format slowly
02:40 ^🔗	SketchCow	.AVI
02:41 ^🔗	SketchCow	I love all you rascals but I'm not going to be convinced here.
02:41 ^🔗	Coderjoe	.avi is like .tar for video.
02:42 ^🔗	Coderjoe	so there is still the issue of codecs to use
02:42 ^🔗	Coderjoe	(just throwing information in the ring.)
02:45 ^🔗	Coderjoe	just don't use things like cinepak or indeo :D
02:52 ^🔗	underscor	.mkv all the way
02:52 ^🔗	underscor	.mkv and h264
02:55 ^🔗	DFJustin	mp4 is more widely supported than mkv
03:00 ^🔗	underscor	yeah but I like mkv as a container
03:04 ^🔗	dashcloud	as long as you don't invent new fourccs for existing codecs or stick h264 into avi, we can be friends
03:09 ^🔗	chronomex	then you'll hate being my roommate
06:49 ^🔗	Dan68	Any archived sites that'd be fun to mirror on a box not connected to the internet but accessible to a large group of people?
06:51 ^🔗	Dan68	I've got about 200gb of extra space to throw at it, maybe a few news sites like the montreal mirror or something simmilar
08:11 ^🔗	ersi	Dan68: You.. uh.. what?
08:11 ^🔗	ersi	I dunno, like.. Wikipedia? TVTropes? That might be fun for a bunch of people not connected to net I guess..
08:12 ^🔗	Dan68	eh
08:14 ^🔗	Dan68	Already am mirroring wikipedia (text only), if I had an extra hdd that was large enough I would grab a copy of the geocites archive
08:20 ^🔗	godane	this has to be changed to text section: http://archive.org/details/groklaw.net-pdfs-2007-20120827
08:20 ^🔗	godane	i put in video by mistake
11:41 ^🔗	tef_	chronomex: hrm?
11:42 ^🔗	tef_	chronomex: what are you writing the warc proxy in ?
12:17 ^🔗	tef_	chronomex, alard fwiw I've mirrored warctools on github https://github.com/tef/warctools
13:49 ^🔗	ersi	tef_: cool! :)
13:59 ^🔗	tef_	unfortunately work has informed me I can't add a warc writing proxy to warctools, but paradoxically, I can accept push requests containing it
13:59 ^🔗	tef_	head explodes
13:59 ^🔗	tef_	but i have permission to add a different proxy, and http bits to it, and all the component parts
14:00 ^🔗	tef_	it's as if I am allow to commit '2' and but not '2+2' because 4 is our IP
15:03 ^🔗	SketchCow	Converting the month to a number.
15:03 ^🔗	SketchCow	Here's what I plan to do.
15:03 ^🔗	SketchCow	I will add an item called cuamiga-magazine-104.
15:03 ^🔗	SketchCow	In the collection named cuamiga-magazine...
15:03 ^🔗	SketchCow	OK, then, CUAmiga_104_Oct_1998.pdf gets the love.
15:03 ^🔗	SketchCow	I will say this dates to 1998-10.
15:03 ^🔗	SketchCow	I will give it the title of CU Amiga Magazine Issue 104.
15:03 ^🔗	SketchCow	The ingestor's back!
15:22 ^🔗	SmileyG	s/[jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec]/[01,02,03,04,05,06,07,08,09,10,11,12]/g or something
15:25 ^🔗	SketchCow	That's what it does, yes.
15:25 ^🔗	SketchCow	xma to 12
15:25 ^🔗	SketchCow	:)
15:26 ^🔗	*	SmileyG turns the month upto 13
15:26 ^🔗	SketchCow	Looks like roughly 300 magazine issues from 5 titles are going in.
15:36 ^🔗	SketchCow	330
15:36 ^🔗	SketchCow	Excellent.
15:36 ^🔗	SketchCow	And 8 titles
15:39 ^🔗	SmileyG	Warning. Mr Burns type monolog detected ;)
15:39 ^🔗	underscor	:D
16:20 ^🔗	SketchCow	http://archive.org/details/page6-magazine
18:59 ^🔗	SketchCow	About to shove 33gb, 128 issues of BYTE into archive.org.
19:00 ^🔗	SketchCow	I've already flooded/broken the incoming stream, the thing that Stops Jason From Adding So Much Shit kicked in and blocked me
19:00 ^🔗	SketchCow	So let's triple it!!!!!!
19:06 ^🔗	Lord_Nigh	SketchCow: btw how is the archive.org raid set up? is it raid-5 or raid-6?
19:06 ^🔗	DFJustin	LIMITER DISENGAGE
19:07 ^🔗	Lord_Nigh	(i assume the latter given this is archival)
19:10 ^🔗	SketchCow	I am not qualified to discuss how the archive.org systems are set up.
19:11 ^🔗	SketchCow	DFJustin: The limiter only applies to me and only engages when I and other cause a 200+ document backup or something
19:13 ^🔗	Lord_Nigh	maybe its raid-12: a mirrored array of raid-6 drive
19:13 ^🔗	brayden	How many disks can fail before you lose data?
19:13 ^🔗	Lord_Nigh	raid-12? as long as you have no common disks fail in both arrays, two in each array
19:14 ^🔗	Lord_Nigh	raid5 can survive one disk failing, raid6 can survive two
19:14 ^🔗	underscor	We do not have raid.
19:14 ^🔗	brayden	THinking maybe SketchCow got told this once and can at least disclose that.
19:14 ^🔗	brayden	wat.
19:14 ^🔗	underscor	There are two copies of each file, in separate datacenters
19:14 ^🔗	brayden	fair enough
19:14 ^🔗	brayden	some sort of rsync like thing going on I guess?
19:14 ^🔗	Lord_Nigh	ah so its logical raid/rsync
19:14 ^🔗	Lord_Nigh	file-level "raid"
19:15 ^🔗	Lord_Nigh	like those hp NASes use
19:15 ^🔗	underscor	Yeah.
19:15 ^🔗	underscor	That's what "bup" tasks are
19:15 ^🔗	underscor	Rsync from primary (ia6) to secondary (ia7)
19:16 ^🔗	Lord_Nigh	you guys need a 3rd datacenter drilled into granite bedrock in sweden or something
19:16 ^🔗	Lord_Nigh	or maybe in one of those nuke-proof missile silos
19:17 ^🔗	DopefishJ	what's the bandwidth like in svalbard
19:17 ^🔗	soultcer	Doesn't xs4all in the Netherlands host a partial copy, and the Library of Alexandria another one?
19:18 ^🔗	balrog_	btw, does dr dobbs journal need to be scanned?
19:18 ^🔗	SketchCow	I'd rather something more obscure be scanned.
19:19 ^🔗	SketchCow	Something the world is more in danger of losing.
19:19 ^🔗	balrog_	hm... software manuals? :)_
19:19 ^🔗	SketchCow	Yes, things like that.
19:19 ^🔗	Lord_Nigh	the source code to ms-basic-80
19:19 ^🔗	SketchCow	bitsavers is doing what they can
19:20 ^🔗	Lord_Nigh	iirc mit or someone else has a printed copy of the ms-basic-80 src but microsoft forbids them from duplicating it
19:20 ^🔗	Lord_Nigh	you're allowed to LOOK at it though, but that's it
19:20 ^🔗	balrog_	which version is basic-80? I know several people have the source to the 6502 BASIC
19:20 ^🔗	balrog_	and afaik at least one version is out there
19:20 ^🔗	Lord_Nigh	basic-80 is the 8080 altair basic
19:20 ^🔗	Lord_Nigh	this is the source code used to assemble it
19:22 ^🔗	Lord_Nigh	what i really want to get a copy of is the fortran or C source code and prosodic/morphemic data files from mitalk
19:22 ^🔗	Lord_Nigh	that's a majorly important part of speech synthesis history
19:23 ^🔗	SketchCow	http://archive.org/details/thingiverse-20110829
19:23 ^🔗	Lord_Nigh	it was not the first software speech synthesizer engine but it was the first modern one which actually used language parts etc to decompose words and not simple letter to sound rules like mcilroy's algorithm and the NRL algorithm
19:23 ^🔗	SketchCow	We're due another one. Can someone do that?
19:23 ^🔗	Lord_Nigh	thingiverse has lost hundreds if not thousands of designs just today from people deleting them
19:24 ^🔗	SketchCow	That's fine.
19:24 ^🔗	SketchCow	We're due another one. Can someone do that?
19:25 ^🔗	Lord_Nigh	not enough space here :(
19:26 ^🔗	SketchCow	Someone did an excellent download before
19:26 ^🔗	SketchCow	And can do it again
19:36 ^🔗	alard	SketchCow: I think I have the thingiverse scripts here. (And they're also on Github, of course: https://github.com/ArchiveTeam/scrapy-thingy ) But is that still the best way to do it? It doesn't produce warcs.
19:37 ^🔗	SketchCow	For this, yes.
19:38 ^🔗	SketchCow	I just want it before it's all gone
19:38 ^🔗	SketchCow	Before this assholery deletes all of it
19:38 ^🔗	SketchCow	Which one do I want? i'll just run it.
19:38 ^🔗	SketchCow	Got it, understand it now.
19:39 ^🔗	SketchCow	all_things calls thingy and all_users calls usery
19:39 ^🔗	alard	Yes, I think that was the idea.
19:40 ^🔗	alard	I wrote it downloads "recent additions", so I assume it can do incremental updates if you want.
19:43 ^🔗	alard	underscor: Are you downloading the sony forums? (Just checking.)
19:43 ^🔗	chronomex	tef_: I'm dabbling with various python http-proxy things. haven't yet found one that works to my satisfaction.
19:45 ^🔗	underscor	alard: as far as I know it's running
19:45 ^🔗	underscor	I can't ssh from this access point (firewalled), but I'll get a status soon
19:47 ^🔗	alard	Ah, good.
19:49 ^🔗	SketchCow	Downloading the thingies
20:36 ^🔗	Nemo_bis	SketchCow: can you add to wikiteam? http://archive.org/details/wikitweets
20:36 ^🔗	Nemo_bis	unless there's a twitter collection which is more relevant
21:02 ^🔗	SketchCow	500 of 30,000 things now downloaded.
22:39 ^🔗	tef_	chronomex: mitmproxy ?
22:42 ^🔗	chronomex	hm, haven't looked into that
22:43 ^🔗	chronomex	I was looking at the simplest proxies I could, to make sure that nothing was getting modified or cached
22:43 ^🔗	chronomex	but then I wound up with a proxy that couldn't handle running a real browser through it
22:43 ^🔗	chronomex	like it would send everything to the first host it connected to
22:48 ^🔗	soultcer	chronomex: How about using ICAP: https://en.wikipedia.org/wiki/Internet_Content_Adaptation_Protocol
22:48 ^🔗	soultcer	http://icap-server.sourceforge.net/ seems to provide a client too.
22:49 ^🔗	chronomex	huh
22:49 ^🔗	chronomex	what exactly would this do for me?
22:50 ^🔗	soultcer	It is a protocol that allows a web proxy to hand off "content filtering, ad insertion" to an icap server
22:50 ^🔗	soultcer	But instead of content filtering, you could simply write all data you receive into a warc file and return the original data to the proxy
22:50 ^🔗	chronomex	hmm
22:50 ^🔗	chronomex	interesting
22:51 ^🔗	soultcer	Since icap is supported by multiple mature proxy implementations, it might be an alternative to finding a working proxy implementation in python
22:51 ^🔗	chronomex	do you know if headers are propagated to icap?
22:52 ^🔗	soultcer	To be honest, I have no idea how ICAP works in detail, but I thought it might work for you
22:52 ^🔗	chronomex	it's something to look into, to be sure
22:54 ^🔗	DrainLbry	sketchcow: high fucking five on the thingiverse grab
22:54 ^🔗	DrainLbry	i've been worrying about the state of 3d objects lately, that shits gonna be regulated, DRM'd, and productized so fast
22:54 ^🔗	chronomex	soultcer: okay, it looks like the proxy munges HTTP headers into ICAP headers.
22:55 ^🔗	soultcer	The question is, does it allow you to match up requests with responses? Or does warc not contain the request headers?
22:55 ^🔗	chronomex	orrr, I was wrong - https://tools.ietf.org/html/rfc3507#page-18
22:55 ^🔗	chronomex	warc has request and response
22:55 ^🔗	chronomex	they're tied together with a uuid
22:55 ^🔗	chronomex	req has a uuid and a link to the uuid of the resp, and vice versa
22:56 ^🔗	chronomex	just use zless on a .warc.gz, it's human-readable
22:56 ^🔗	soultcer	Oh, I thought only arc files were human-readable
22:56 ^🔗	chronomex	nope :)
23:21 ^🔗	SketchCow	Bre Pettis thanked us when we grabbed it last summer.
23:27 ^🔗	chronomex	soultcer, tef_: part of the reason I picked up this thread of development yesterday was my discovery that `wget --page-requisites` doesn't even fetch urls from <script src=
23:27 ^🔗	chronomex	which kind of seems wrong ...
23:28 ^🔗	soultcer	Yeah, I wrote my own "mirror wget" in ruby after I discovered that, though I haven't updated it to use warc files
23:29 ^🔗	chronomex	my final solution will consist of some kind of headless browser that does js, and a warc-writing proxy.
23:30 ^🔗	soultcer	You could write a firefox plugin that writes warcs :-)
23:30 ^🔗	chronomex	oh dear
23:30 ^🔗	chronomex	I don't even use firefox!
23:31 ^🔗	soultcer	Haha Internet Explorer?
23:32 ^🔗	chronomex	opera for most things, chrome for things that don't work in opera and for flash
23:36 ^🔗	DFJustin	submit a patch for wget so it does include scripts?
23:36 ^🔗	chronomex	but do you expect it to run scripts and get things that the script asks for?
23:37 ^🔗	chronomex	I think they drew a line and decided that they were fine with not fetching any js
23:37 ^🔗	DFJustin	I don't but it seems silly to leave something obvious like that on the floor
23:37 ^🔗	*	chronomex shrugs

irclogger-viewer