[00:22] fellow archivers, what's the best way to label a DVD ? [00:24] write on paper envelope [00:33] Don't make it [00:48] People shouldn't be making DVDs at this late stage. [00:55] Buy a couple drives and get a place to store offsite [01:06] ^ yup [01:14] i burn to bluray [01:14] but i also have my data still on a drive [01:14] this is only cause i'm poor [01:19] You'll be poorer when you loose your crap [01:19] Two 2gb drives: $200 [01:20] s/g/t/ [01:21] Shhhh [01:22] Dude, I almost had the cash [02:11] SketchCow: the reason I'm making DVDs is I've still got a considerable collection of commercial VHS tapes lying around, and I'm converting them to digital format slowly [02:40] .AVI [02:41] I love all you rascals but I'm not going to be convinced here. [02:41] .avi is like .tar for video. [02:42] so there is still the issue of codecs to use [02:42] (just throwing information in the ring.) [02:45] just don't use things like cinepak or indeo :D [02:52] .mkv all the way [02:52] .mkv and h264 [02:55] mp4 is more widely supported than mkv [03:00] yeah but I like mkv as a container [03:04] as long as you don't invent new fourccs for existing codecs or stick h264 into avi, we can be friends [03:09] then you'll hate being my roommate [06:49] Any archived sites that'd be fun to mirror on a box not connected to the internet but accessible to a large group of people? [06:51] I've got about 200gb of extra space to throw at it, maybe a few news sites like the montreal mirror or something simmilar [08:11] Dan68: You.. uh.. what? [08:11] I dunno, like.. Wikipedia? TVTropes? That might be fun for a bunch of people not connected to net I guess.. [08:12] eh [08:14] Already am mirroring wikipedia (text only), if I had an extra hdd that was large enough I would grab a copy of the geocites archive [08:20] this has to be changed to text section: http://archive.org/details/groklaw.net-pdfs-2007-20120827 [08:20] i put in video by mistake [11:41] chronomex: hrm? [11:42] chronomex: what are you writing the warc proxy in ? [12:17] chronomex, alard fwiw I've mirrored warctools on github https://github.com/tef/warctools [13:49] tef_: cool! :) [13:59] unfortunately work has informed me I can't add a warc writing proxy to warctools, but paradoxically, I can accept push requests containing it [13:59] *head explodes* [13:59] but i have permission to add a different proxy, and http bits to it, and all the component parts [14:00] it's as if I am allow to commit '2' and but not '2+2' because 4 is our IP [15:03] Converting the month to a number. [15:03] Here's what I plan to do. [15:03] I will add an item called cuamiga-magazine-104. [15:03] In the collection named cuamiga-magazine... [15:03] OK, then, CUAmiga_104_Oct_1998.pdf gets the love. [15:03] I will say this dates to 1998-10. [15:03] I will give it the title of CU Amiga Magazine Issue 104. [15:03] The ingestor's back! [15:22] s/[jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec]/[01,02,03,04,05,06,07,08,09,10,11,12]/g or something [15:25] That's what it does, yes. [15:25] xma to 12 [15:25] :) [15:26] * SmileyG turns the month upto 13 [15:26] Looks like roughly 300 magazine issues from 5 titles are going in. [15:36] 330 [15:36] Excellent. [15:36] And 8 titles [15:39] Warning. Mr Burns type monolog detected ;) [15:39] :D [16:20] http://archive.org/details/page6-magazine [18:59] About to shove 33gb, 128 issues of BYTE into archive.org. [19:00] I've already flooded/broken the incoming stream, the thing that Stops Jason From Adding So Much Shit kicked in and blocked me [19:00] So let's triple it!!!!!! [19:06] SketchCow: btw how is the archive.org raid set up? is it raid-5 or raid-6? [19:06] LIMITER DISENGAGE [19:07] (i assume the latter given this is archival) [19:10] I am not qualified to discuss how the archive.org systems are set up. [19:11] DFJustin: The limiter only applies to me and only engages when I and other cause a 200+ document backup or something [19:13] maybe its raid-12: a mirrored array of raid-6 drive [19:13] How many disks can fail before you lose data? [19:13] raid-12? as long as you have no common disks fail in both arrays, two in each array [19:14] raid5 can survive one disk failing, raid6 can survive two [19:14] We do not have raid. [19:14] THinking maybe SketchCow got told this once and can at least disclose that. [19:14] wat. [19:14] There are two copies of each file, in separate datacenters [19:14] fair enough [19:14] some sort of rsync like thing going on I guess? [19:14] ah so its logical raid/rsync [19:14] file-level "raid" [19:15] like those hp NASes use [19:15] Yeah. [19:15] That's what "bup" tasks are [19:15] Rsync from primary (ia6) to secondary (ia7) [19:16] you guys need a 3rd datacenter drilled into granite bedrock in sweden or something [19:16] or maybe in one of those nuke-proof missile silos [19:17] what's the bandwidth like in svalbard [19:17] Doesn't xs4all in the Netherlands host a partial copy, and the Library of Alexandria another one? [19:18] btw, does dr dobbs journal need to be scanned? [19:18] I'd rather something more obscure be scanned. [19:19] Something the world is more in danger of losing. [19:19] hm... software manuals? :)_ [19:19] Yes, things like that. [19:19] the source code to ms-basic-80 [19:19] bitsavers is doing what they can [19:20] iirc mit or someone else has a printed copy of the ms-basic-80 src but microsoft forbids them from duplicating it [19:20] you're allowed to LOOK at it though, but that's it [19:20] which version is basic-80? I know several people have the source to the 6502 BASIC [19:20] and afaik at least one version is out there [19:20] basic-80 is the 8080 altair basic [19:20] this is the source code used to assemble it [19:22] what i really want to get a copy of is the fortran or C source code and prosodic/morphemic data files from mitalk [19:22] that's a majorly important part of speech synthesis history [19:23] http://archive.org/details/thingiverse-20110829 [19:23] it was not the first software speech synthesizer engine but it was the first modern one which actually used language parts etc to decompose words and not simple letter to sound rules like mcilroy's algorithm and the NRL algorithm [19:23] We're due another one. Can someone do that? [19:23] thingiverse has lost hundreds if not thousands of designs just today from people deleting them [19:24] That's fine. [19:24] We're due another one. Can someone do that? [19:25] not enough space here :( [19:26] Someone did an excellent download before [19:26] And can do it again [19:36] SketchCow: I think I have the thingiverse scripts here. (And they're also on Github, of course: https://github.com/ArchiveTeam/scrapy-thingy ) But is that still the best way to do it? It doesn't produce warcs. [19:37] For this, yes. [19:38] I just want it before it's all gone [19:38] Before this assholery deletes all of it [19:38] Which one do I want? i'll just run it. [19:38] Got it, understand it now. [19:39] all_things calls thingy and all_users calls usery [19:39] Yes, I think that was the idea. [19:40] I wrote it downloads "recent additions", so I assume it can do incremental updates if you want. [19:43] underscor: Are you downloading the sony forums? (Just checking.) [19:43] tef_: I'm dabbling with various python http-proxy things. haven't yet found one that works to my satisfaction. [19:45] alard: as far as I know it's running [19:45] I can't ssh from this access point (firewalled), but I'll get a status soon [19:47] Ah, good. [19:49] Downloading the thingies [20:36] SketchCow: can you add to wikiteam? http://archive.org/details/wikitweets [20:36] unless there's a twitter collection which is more relevant [21:02] 500 of 30,000 things now downloaded. [22:39] chronomex: mitmproxy ? [22:42] hm, haven't looked into that [22:43] I was looking at the simplest proxies I could, to make sure that nothing was getting modified or cached [22:43] but then I wound up with a proxy that couldn't handle running a real browser through it [22:43] like it would send everything to the first host it connected to [22:48] chronomex: How about using ICAP: https://en.wikipedia.org/wiki/Internet_Content_Adaptation_Protocol [22:48] http://icap-server.sourceforge.net/ seems to provide a client too. [22:49] huh [22:49] what exactly would this do for me? [22:50] It is a protocol that allows a web proxy to hand off "content filtering, ad insertion" to an icap server [22:50] But instead of content filtering, you could simply write all data you receive into a warc file and return the original data to the proxy [22:50] hmm [22:50] interesting [22:51] Since icap is supported by multiple mature proxy implementations, it might be an alternative to finding a working proxy implementation in python [22:51] do you know if headers are propagated to icap? [22:52] To be honest, I have no idea how ICAP works in detail, but I thought it might work for you [22:52] it's something to look into, to be sure [22:54] sketchcow: high fucking five on the thingiverse grab [22:54] i've been worrying about the state of 3d objects lately, that shits gonna be regulated, DRM'd, and productized so fast [22:54] soultcer: okay, it looks like the proxy munges HTTP headers into ICAP headers. [22:55] The question is, does it allow you to match up requests with responses? Or does warc not contain the request headers? [22:55] orrr, I was wrong - https://tools.ietf.org/html/rfc3507#page-18 [22:55] warc has request and response [22:55] they're tied together with a uuid [22:55] req has a uuid and a link to the uuid of the resp, and vice versa [22:56] just use zless on a .warc.gz, it's human-readable [22:56] Oh, I thought only arc files were human-readable [22:56] nope :) [23:21] Bre Pettis thanked us when we grabbed it last summer. [23:27] soultcer, tef_: part of the reason I picked up this thread of development yesterday was my discovery that `wget --page-requisites` doesn't even fetch urls from