[00:13] <Coderjoe> trying to find some video you saw over a year ago on youtube is like ...
[00:13] <Coderjoe> trying to find a book in the remains of a library explosion?
[00:13] <Coderjoe> (a specific book, of course)
[00:15] <yipdw> trying to find some video you saw over a year ago on YouTube is like THIS VIDEO HAS BEEN TAKEN DOWN DUE TO A COPYRIGHT CLAIM FROM UNITED MEDIA CONGLOMERATE
[00:18] <Coderjoe> well, except the video I am trying to find was entirely original content
[00:19] <Ymgve> doesn't stop someone from DMCAing it on a hunch
[02:48] <underscor> Coderjoe: Hahaha, yeah
[02:49] <underscor> Well, with something on the order of 36 hours of content uploaded a minute
[02:49] <chronomex> about 25 hours of content dmca'd a minute
[02:49] <underscor> hahaa
[02:50] <yipdw> tef: the decompression feature you added to warc2warc seems to work fine
[02:50] <yipdw> tef: I'm uploading a warc2warc'd warc to a Wayback instance to check the result
[02:50] <Paradoks> If only we could watch all the new Youtube content Back to the Future II style.
[02:51] <yipdw> damnit
[02:51] <yipdw> I tripped over my DNS server
[02:51] <chronomex> lol
[02:52] <yipdw> or more specifically its serial cable
[03:28] <yipdw> oh, hmm
[03:28] <yipdw> tef: I think warc2warc may be losing data
[03:28] <dashcloud> anyone used this filesystem before? http://www.lessfs.com/ it describes itself as: A high performance inline data deduplicating filesystem for Linux.
[03:29] <yipdw> tef: is there a place that I can send you examples? or would you prefer me to upload example WARCs somewhere?"
[03:30] <tef> thomas.figg@hanzoarchives.com
[03:30] <yipdw> ah, ok
[03:30] <tef> it *shouldn't be losing data* obviously
[03:30] <yipdw> right
[03:30] <tef> I will look at it tomorrow morning, ish. Cos it's almost 4 am
[03:30] <yipdw> I noticed, however, that one WARC record was truncated to zero bytes
[03:31] <yipdw> sure, no problem
[03:31] <tef> hmm, that would suggest something failed to decode
[03:31] <yipdw> yeah, possibly.  there's not much more that can be done until I send you examples though :)
[03:31] <yipdw> so yeah, I'll write that up etc
[03:37] <tef> obv if it is easier to upload them and link them, I am all for that too
[03:38] <tef> I'm quite happy to work from 'this warc does not work as expected'
[03:39] <yipdw> sure
[03:39] <yipdw> actually, would you prefer that I file an issue in Bitbucket?
[03:39] <yipdw> none of this is really private
[03:41] <tef> i'm just happy with test data :3
[03:41] <tef> it's more than I get a lot of the time...
[04:01] <tef> yipdw: that's perfect, thanks. I will get back to you tomorrow
[06:40] <savetz> mobileme-grab quit on me with "error downloading 'sboyack' - do I need to do anything special to make sure that gets grabbed again?
[06:40] <savetz> the reason is apparently because my hard drive filled up
[07:21] <yipdw> savetz: you can either run dld-single.sh to explicitly redownload it, or just let it linger in the queue -- it'll be requeued at some point
[12:03] <Ymgve> SketchCow: seen this? http://www.metafilter.com/111701/Putting-kickstarter-out-of-business
[12:03] <Ymgve> Might be of interest since you've used Kickstarter
[12:08] <ersi> So.. they should be put out of business because they make money?
[12:08] <ersi> They offer an umbrella, a brand and a collected spot with people that like to fling their wallets open
[12:09] <Ymgve> 5% sounds a bit much, though
[12:10] <ersi> I'm not able to judge that, but I believe they take a lot of hassle out of asking people for money to do a project
[12:11] <Ymgve> true
[12:12] <ersi> It sounds like ramblings from an outsider that hasn't tried A) raising money B) tried out kickstarter
[14:48] <tef> yipdw: found the bug
[14:48] <tef> well technically it is in warc-wget :/ it is producing warc records, with http in them, that *claim* to be chunked, and aren't
[14:51] <tef> I can put a work around in that broken http messages are left untouched. I'm not sure what is the right thing to do when you get a transfer-encoded:chunked with no chunks
[14:56] <tef> fwiw - if you put the chunks back in, -D will clean them
[17:13] <savetz> the MobileMe grabber has been crashing on me in the last 24 hours. is apple being weird? last time= ERROR (3)
[17:28] <yipdw> tef: got your message; that's some weird wget-warc behavior
[17:29] <yipdw> I'll check what's actually coming back from the server
[17:32] <Coderjoe_> yipdw: I think I know why wget-warc is doing it. I think it snags the result after wget has handled transfer-encoding, rather than the raw network stream.
[17:33] <Coderjoe_> ISTR it hooks a temporary file or something into the output code
[17:34] <yipdw> oh, so the response really is chunked but the warc code isn't able to write the chunks?
[17:34] <yipdw> hm
[17:34] <yipdw> one sec, I'm gonna see what curl does for some of these URIs
[17:34] <yipdw> I haven't actually examined what comes back from these servers too closely
[17:36] <Coderjoe> I'm fairly sure that is correct
[17:37] <yipdw> well, yeah, the response from ff.net is definitely marked as chunked
[17:38] <yipdw> hmm
[17:39] <yipdw> I wonder if there's a way to get precisely what wget gets
[17:40] <Coderjoe> yeah. move the hook into the network handling code, before the transfer-encoding handler (or perhaps IN the transfer-encoding handler?)
[17:41] <yipdw> from the way the WARC code is structured, that doesn't look like it's a trivial job
[17:41] <yipdw> or rather it looks like it was designed to sit at arm's length from the rest of wget
[17:42] <yipdw> but I dunno, I've only skimmed over said code
[17:42] <yipdw> alard would know best
[17:46] <yipdw> damn
[17:46] <yipdw> just been reading the gethttp function in wget's http.c
[17:47] <yipdw> that could really use a state machine formalization
[17:47] <yipdw> IMO, anyway, but I've been told that I'm weird
[17:56] <yipdw> on a lighter note
[17:56] <yipdw> http://tctechcrunch2011.files.wordpress.com/2011/11/screen-shot-2011-11-20-at-8-44-25-pm1.png?w=620
[18:35] <tef> warcs are meant to be *raw* traffic somewhat
[18:36] <tef> well really - delete any transfer-encoding headers and add a content-length
[18:36] <tef> or put the raw traffic in
[18:36] <tef> hmm
[18:37] <savetz> my mobile-me downloader is crashing constantly. is it me? ERROR (3)s all the time
[18:45] <SketchCow> Ymgve: That guy is an idiot.
[18:45] <SketchCow> Kickstarter does SO MUCH MORE for the 5%
[18:57] <balrog> hey SketchCow
[18:57] <balrog> off the top of your head, what would be useful metadata fields for a floppy archival format?
[18:58] <tef> yipdw: I think it would be easier to strip out the headers that don't apply - like transfer-encoding
[19:08] <Coderjoe> one post: Short version: "I don't like something and don't want to use it, so therefore it should be made to go away." Oh FFS...
[19:43] <SketchCow> Oh, god.
[19:44] <SketchCow> OK, in short form: There is SO MUCH FUCKING WORK being done on that, the website would REALLY be improved with links to all that.
[19:44] <SketchCow> So when I get back from the interview I'm doing, let me find that.
[19:44] <SketchCow> I want to do some actualy archive team business today, so look forward to that.
[19:51] <ersi> Booya!
[20:19] <yipdw> tef: I'll run that by alard, he's more familiar with wget's WARC code than me
[20:19] <yipdw> if there ends up being an easy way to hook into what's coming over the wire that'd be worth trying too
[20:20] <tef> yes
[20:23] <yipdw> this'll be fun, though; I know that people in this channel have generated several terabytes of WARCs that may have invalid records :P
[20:23] <yipdw> so going forward we'll need some way to fix that -- maybe stripping out Transfer-Encoding would be easiest
[20:24] <yipdw> guess I'll poke around the warc-tools source and see how hard it'd be to write a tool like that
[20:24] <yipdw> and I guess this would be an excuse to break out the Hadoop or something
[20:40] <chronomex> fix-brokenass-shit.sh
[20:55] <yipdw> chronomex: nah
[20:55] <yipdw> fix-brokenass-shit.rb if I'm writing it, obviously
[20:55] <yipdw> or maybe it can be FixBrokenassShit.hs and then everyone will be angry
[20:59] <underscor> lol
[21:10] <chronomex> hHaskell? Excellent.
[21:11] <chronomex> you can make more people mad by writing it in Erlang.
[21:11] <chronomex> YOU DON'T NEED CONCURRENCY FOR THAT, WTF MAN
[21:15] <yipdw> yes
[21:15] <yipdw> one Erlang process per WARC
[21:16] <yipdw> actually, that would probably scale pretty well
[21:30] <tef> yipdw: I can hack warc2warc to do it
[21:32] <tef> it's literally putting a content = content.replace("Transfer-Encoding: chunked\r\n","") or similar iirc
[21:32] <tef> but i'd rather not a specific wget fix in the trunk, but it is easy to fork
[21:33] <tef> I guess I can add it as an option 'sniff: chunking'
[21:33] <tef> or -W for --wget-workaround
[21:33] <tef> or something
[21:35] <yipdw> tef: I was just going to write something for use by people here
[21:35] <yipdw> as I think it's got the largest concentration of people who have used wget's WARC writer
[21:35] <yipdw> or, really, just get something that SketchCow (or whoever) can run on a large batch of WARCs
[21:35] <yipdw> and concurrently fix up wget's WARC writer to do The Right Thing
[21:36] <yipdw> but, yeah, a fork works too
[21:43] <closure> http://blog.archive.org/2012/01/17/12-hours-dark-internet-archive-vs-censorship/
[21:44] <yipdw> I didn't know IA was blacklisted in China
[21:44] <yipdw> though I shouldn't be surprised
[21:58] <chronomex> ditto
[22:02] <yipdw> http://abcnews.go.com/Technology/wikipedia-blackout-websites-wikipedia-reddit-dark-wednesday-protest/story?id=15373251#.TxXwA2OXRN0
[22:02] <yipdw> what crappy lead graphics
[22:03] <yipdw> what sort of two-bit hack graphic artists work for the major news media these days?
[22:18] <yipdw> ha, and it's awesome seeing what Rupert Murdoch is tweeting
[22:18] <yipdw> it seems that Google, a software company, is out to destroy "software creators"
[22:18] <tef> yipdw: i'll just make a wget-warc-clean.py or something
[22:18] <tef> i tend to avoid names like unfuck
[22:18] <tef> :3
[22:18] <yipdw> that man is like Gene Ray, Cubic, but with money
[22:18] <yipdw> tef: heh, good thing
[22:25] <NovaKing> http://torrentfreak.com/mpaa-internet-blackout-is-a-pr-stunt-users-are-corporate-pawns-120117/
[22:27] <yipdw> well, yeah, it is a PR stunt
[22:27] <yipdw> good to know they can state the obvious
[22:27] <Coderjoe> bleh. fix-broken-shit.py
[22:27] <tef> work repo
[22:28] <yipdw> wait
[22:28] <yipdw> "The following is a statement by Senator Chris Dodd, Chairman and CEO of the Motion Picture Association of America, Inc. (MPAA)"
[22:28] <yipdw> I didn't know corruption had gone tha far!
[22:28] <yipdw> +t
[22:29] <yipdw> man, talk about cutting out the middleman
[22:29] <yipdw> why buy off legislature when you can just install someone to do your work
[22:30] <chronomex> because legislature is cheap
[22:30] <chronomex> $5000ish
[22:30] <Coderjoe> (note, when I said that, I had just read the ruby/haskell/erlang part of my log)
[22:30] <yipdw> chronomex: so are some Indian software houses, but that doesn't mean you'll get what you want
[22:31] <chronomex> yipdw: like legislature!
[22:31] <Coderjoe> it freaks me out when I call the bank credit card number and get someone in india.
[22:33] <yipdw> I think that means you need a new bank
[22:34] <Coderjoe> yeah... I have to wonder if that's actually how that card got compromised in the first place
[22:36] <chronomex> lol
[22:36] <Coderjoe> i've been working on leaving them.
[22:36] <Coderjoe> but how do I know if a bank I am looking at has outsourced their calls?
[22:37] <Coderjoe> and to where
[22:38] <Coderjoe> (this is also one of the largest banks in the world we're talking about)
[22:40] <closure> Coderjoe: didn't know you were into haskell
[22:42] <Coderjoe> I'm curious where I gave the impression I was
[22:42] <closure> "ruby/haskell/erlang" above
[22:43] <Coderjoe> closure: yipdw is the one that mention haskell. I was reading my channel scrollback
[22:43] <yipdw> Coderjoe eats GADTs for breakfast
[22:43] <Coderjoe> s/(mention)/\1ed/
[22:45] <closure> >>= and <*> would be fun breakfast cereal indeed
[22:47] <yipdw> cherri (.) s
[23:18] <SketchCow> Downloading the Sound of Young America - we'll have that going up today.
[23:18] <SketchCow> Awww yeah