#archiveteam 2012-09-20,Thu

↑back Search

Time Nickname Message
00:22 🔗 dashcloud fellow archivers, what's the best way to label a DVD ?
00:24 🔗 chronomex write on paper envelope
00:33 🔗 SketchCow Don't make it
00:48 🔗 SketchCow People shouldn't be making DVDs at this late stage.
00:55 🔗 SketchCow Buy a couple drives and get a place to store offsite
01:06 🔗 chronomex ^ yup
01:14 🔗 godane i burn to bluray
01:14 🔗 godane but i also have my data still on a drive
01:14 🔗 godane this is only cause i'm poor
01:19 🔗 SketchCow You'll be poorer when you loose your crap
01:19 🔗 SketchCow Two 2gb drives: $200
01:20 🔗 underscor s/g/t/
01:21 🔗 SketchCow Shhhh
01:22 🔗 SketchCow Dude, I almost had the cash
02:11 🔗 dashcloud SketchCow: the reason I'm making DVDs is I've still got a considerable collection of commercial VHS tapes lying around, and I'm converting them to digital format slowly
02:40 🔗 SketchCow .AVI
02:41 🔗 SketchCow I love all you rascals but I'm not going to be convinced here.
02:41 🔗 Coderjoe .avi is like .tar for video.
02:42 🔗 Coderjoe so there is still the issue of codecs to use
02:42 🔗 Coderjoe (just throwing information in the ring.)
02:45 🔗 Coderjoe just don't use things like cinepak or indeo :D
02:52 🔗 underscor .mkv all the way
02:52 🔗 underscor .mkv and h264
02:55 🔗 DFJustin mp4 is more widely supported than mkv
03:00 🔗 underscor yeah but I like mkv as a container
03:04 🔗 dashcloud as long as you don't invent new fourccs for existing codecs or stick h264 into avi, we can be friends
03:09 🔗 chronomex then you'll hate being my roommate
06:49 🔗 Dan68 Any archived sites that'd be fun to mirror on a box not connected to the internet but accessible to a large group of people?
06:51 🔗 Dan68 I've got about 200gb of extra space to throw at it, maybe a few news sites like the montreal mirror or something simmilar
08:11 🔗 ersi Dan68: You.. uh.. what?
08:11 🔗 ersi I dunno, like.. Wikipedia? TVTropes? That might be fun for a bunch of people not connected to net I guess..
08:12 🔗 Dan68 eh
08:14 🔗 Dan68 Already am mirroring wikipedia (text only), if I had an extra hdd that was large enough I would grab a copy of the geocites archive
08:20 🔗 godane this has to be changed to text section: http://archive.org/details/groklaw.net-pdfs-2007-20120827
08:20 🔗 godane i put in video by mistake
11:41 🔗 tef_ chronomex: hrm?
11:42 🔗 tef_ chronomex: what are you writing the warc proxy in ?
12:17 🔗 tef_ chronomex, alard fwiw I've mirrored warctools on github https://github.com/tef/warctools
13:49 🔗 ersi tef_: cool! :)
13:59 🔗 tef_ unfortunately work has informed me I can't add a warc writing proxy to warctools, but paradoxically, I can accept push requests containing it
13:59 🔗 tef_ *head explodes*
13:59 🔗 tef_ but i have permission to add a different proxy, and http bits to it, and all the component parts
14:00 🔗 tef_ it's as if I am allow to commit '2' and but not '2+2' because 4 is our IP
15:03 🔗 SketchCow Converting the month to a number.
15:03 🔗 SketchCow Here's what I plan to do.
15:03 🔗 SketchCow I will add an item called cuamiga-magazine-104.
15:03 🔗 SketchCow In the collection named cuamiga-magazine...
15:03 🔗 SketchCow OK, then, CUAmiga_104_Oct_1998.pdf gets the love.
15:03 🔗 SketchCow I will say this dates to 1998-10.
15:03 🔗 SketchCow I will give it the title of CU Amiga Magazine Issue 104.
15:03 🔗 SketchCow The ingestor's back!
15:22 🔗 SmileyG s/[jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec]/[01,02,03,04,05,06,07,08,09,10,11,12]/g or something
15:25 🔗 SketchCow That's what it does, yes.
15:25 🔗 SketchCow xma to 12
15:25 🔗 SketchCow :)
15:26 🔗 * SmileyG turns the month upto 13
15:26 🔗 SketchCow Looks like roughly 300 magazine issues from 5 titles are going in.
15:36 🔗 SketchCow 330
15:36 🔗 SketchCow Excellent.
15:36 🔗 SketchCow And 8 titles
15:39 🔗 SmileyG Warning. Mr Burns type monolog detected ;)
15:39 🔗 underscor :D
16:20 🔗 SketchCow http://archive.org/details/page6-magazine
18:59 🔗 SketchCow About to shove 33gb, 128 issues of BYTE into archive.org.
19:00 🔗 SketchCow I've already flooded/broken the incoming stream, the thing that Stops Jason From Adding So Much Shit kicked in and blocked me
19:00 🔗 SketchCow So let's triple it!!!!!!
19:06 🔗 Lord_Nigh SketchCow: btw how is the archive.org raid set up? is it raid-5 or raid-6?
19:06 🔗 DFJustin LIMITER DISENGAGE
19:07 🔗 Lord_Nigh (i assume the latter given this is archival)
19:10 🔗 SketchCow I am not qualified to discuss how the archive.org systems are set up.
19:11 🔗 SketchCow DFJustin: The limiter only applies to me and only engages when I and other cause a 200+ document backup or something
19:13 🔗 Lord_Nigh maybe its raid-12: a mirrored array of raid-6 drive
19:13 🔗 brayden How many disks can fail before you lose data?
19:13 🔗 Lord_Nigh raid-12? as long as you have no common disks fail in both arrays, two in each array
19:14 🔗 Lord_Nigh raid5 can survive one disk failing, raid6 can survive two
19:14 🔗 underscor We do not have raid.
19:14 🔗 brayden THinking maybe SketchCow got told this once and can at least disclose that.
19:14 🔗 brayden wat.
19:14 🔗 underscor There are two copies of each file, in separate datacenters
19:14 🔗 brayden fair enough
19:14 🔗 brayden some sort of rsync like thing going on I guess?
19:14 🔗 Lord_Nigh ah so its logical raid/rsync
19:14 🔗 Lord_Nigh file-level "raid"
19:15 🔗 Lord_Nigh like those hp NASes use
19:15 🔗 underscor Yeah.
19:15 🔗 underscor That's what "bup" tasks are
19:15 🔗 underscor Rsync from primary (ia6) to secondary (ia7)
19:16 🔗 Lord_Nigh you guys need a 3rd datacenter drilled into granite bedrock in sweden or something
19:16 🔗 Lord_Nigh or maybe in one of those nuke-proof missile silos
19:17 🔗 DopefishJ what's the bandwidth like in svalbard
19:17 🔗 soultcer Doesn't xs4all in the Netherlands host a partial copy, and the Library of Alexandria another one?
19:18 🔗 balrog_ btw, does dr dobbs journal need to be scanned?
19:18 🔗 SketchCow I'd rather something more obscure be scanned.
19:19 🔗 SketchCow Something the world is more in danger of losing.
19:19 🔗 balrog_ hm... software manuals? :)_
19:19 🔗 SketchCow Yes, things like that.
19:19 🔗 Lord_Nigh the source code to ms-basic-80
19:19 🔗 SketchCow bitsavers is doing what they can
19:20 🔗 Lord_Nigh iirc mit or someone else has a printed copy of the ms-basic-80 src but microsoft forbids them from duplicating it
19:20 🔗 Lord_Nigh you're allowed to LOOK at it though, but that's it
19:20 🔗 balrog_ which version is basic-80? I know several people have the source to the 6502 BASIC
19:20 🔗 balrog_ and afaik at least one version is out there
19:20 🔗 Lord_Nigh basic-80 is the 8080 altair basic
19:20 🔗 Lord_Nigh this is the source code used to assemble it
19:22 🔗 Lord_Nigh what i really want to get a copy of is the fortran or C source code and prosodic/morphemic data files from mitalk
19:22 🔗 Lord_Nigh that's a majorly important part of speech synthesis history
19:23 🔗 SketchCow http://archive.org/details/thingiverse-20110829
19:23 🔗 Lord_Nigh it was not the first software speech synthesizer engine but it was the first modern one which actually used language parts etc to decompose words and not simple letter to sound rules like mcilroy's algorithm and the NRL algorithm
19:23 🔗 SketchCow We're due another one. Can someone do that?
19:23 🔗 Lord_Nigh thingiverse has lost hundreds if not thousands of designs just today from people deleting them
19:24 🔗 SketchCow That's fine.
19:24 🔗 SketchCow We're due another one. Can someone do that?
19:25 🔗 Lord_Nigh not enough space here :(
19:26 🔗 SketchCow Someone did an excellent download before
19:26 🔗 SketchCow And can do it again
19:36 🔗 alard SketchCow: I think I have the thingiverse scripts here. (And they're also on Github, of course: https://github.com/ArchiveTeam/scrapy-thingy ) But is that still the best way to do it? It doesn't produce warcs.
19:37 🔗 SketchCow For this, yes.
19:38 🔗 SketchCow I just want it before it's all gone
19:38 🔗 SketchCow Before this assholery deletes all of it
19:38 🔗 SketchCow Which one do I want? i'll just run it.
19:38 🔗 SketchCow Got it, understand it now.
19:39 🔗 SketchCow all_things calls thingy and all_users calls usery
19:39 🔗 alard Yes, I think that was the idea.
19:40 🔗 alard I wrote it downloads "recent additions", so I assume it can do incremental updates if you want.
19:43 🔗 alard underscor: Are you downloading the sony forums? (Just checking.)
19:43 🔗 chronomex tef_: I'm dabbling with various python http-proxy things. haven't yet found one that works to my satisfaction.
19:45 🔗 underscor alard: as far as I know it's running
19:45 🔗 underscor I can't ssh from this access point (firewalled), but I'll get a status soon
19:47 🔗 alard Ah, good.
19:49 🔗 SketchCow Downloading the thingies
20:36 🔗 Nemo_bis SketchCow: can you add to wikiteam? http://archive.org/details/wikitweets
20:36 🔗 Nemo_bis unless there's a twitter collection which is more relevant
21:02 🔗 SketchCow 500 of 30,000 things now downloaded.
22:39 🔗 tef_ chronomex: mitmproxy ?
22:42 🔗 chronomex hm, haven't looked into that
22:43 🔗 chronomex I was looking at the simplest proxies I could, to make sure that nothing was getting modified or cached
22:43 🔗 chronomex but then I wound up with a proxy that couldn't handle running a real browser through it
22:43 🔗 chronomex like it would send everything to the first host it connected to
22:48 🔗 soultcer chronomex: How about using ICAP: https://en.wikipedia.org/wiki/Internet_Content_Adaptation_Protocol
22:48 🔗 soultcer http://icap-server.sourceforge.net/ seems to provide a client too.
22:49 🔗 chronomex huh
22:49 🔗 chronomex what exactly would this do for me?
22:50 🔗 soultcer It is a protocol that allows a web proxy to hand off "content filtering, ad insertion" to an icap server
22:50 🔗 soultcer But instead of content filtering, you could simply write all data you receive into a warc file and return the original data to the proxy
22:50 🔗 chronomex hmm
22:50 🔗 chronomex interesting
22:51 🔗 soultcer Since icap is supported by multiple mature proxy implementations, it might be an alternative to finding a working proxy implementation in python
22:51 🔗 chronomex do you know if headers are propagated to icap?
22:52 🔗 soultcer To be honest, I have no idea how ICAP works in detail, but I thought it might work for you
22:52 🔗 chronomex it's something to look into, to be sure
22:54 🔗 DrainLbry sketchcow: high fucking five on the thingiverse grab
22:54 🔗 DrainLbry i've been worrying about the state of 3d objects lately, that shits gonna be regulated, DRM'd, and productized so fast
22:54 🔗 chronomex soultcer: okay, it looks like the proxy munges HTTP headers into ICAP headers.
22:55 🔗 soultcer The question is, does it allow you to match up requests with responses? Or does warc not contain the request headers?
22:55 🔗 chronomex orrr, I was wrong - https://tools.ietf.org/html/rfc3507#page-18
22:55 🔗 chronomex warc has request and response
22:55 🔗 chronomex they're tied together with a uuid
22:55 🔗 chronomex req has a uuid and a link to the uuid of the resp, and vice versa
22:56 🔗 chronomex just use zless on a .warc.gz, it's human-readable
22:56 🔗 soultcer Oh, I thought only arc files were human-readable
22:56 🔗 chronomex nope :)
23:21 🔗 SketchCow Bre Pettis thanked us when we grabbed it last summer.
23:27 🔗 chronomex soultcer, tef_: part of the reason I picked up this thread of development yesterday was my discovery that `wget --page-requisites` doesn't even fetch urls from <script src=
23:27 🔗 chronomex which kind of seems wrong ...
23:28 🔗 soultcer Yeah, I wrote my own "mirror wget" in ruby after I discovered that, though I haven't updated it to use warc files
23:29 🔗 chronomex my final solution will consist of some kind of headless browser that does js, and a warc-writing proxy.
23:30 🔗 soultcer You could write a firefox plugin that writes warcs :-)
23:30 🔗 chronomex oh dear
23:30 🔗 chronomex I don't even use firefox!
23:31 🔗 soultcer Haha Internet Explorer?
23:32 🔗 chronomex opera for most things, chrome for things that don't work in opera and for flash
23:36 🔗 DFJustin submit a patch for wget so it does include scripts?
23:36 🔗 chronomex but do you expect it to run scripts and get things that the script asks for?
23:37 🔗 chronomex I think they drew a line and decided that they were fine with not fetching any js
23:37 🔗 DFJustin I don't but it seems silly to leave something obvious like that on the floor
23:37 🔗 * chronomex shrugs

irclogger-viewer