[00:13] trying to find some video you saw over a year ago on youtube is like ... [00:13] trying to find a book in the remains of a library explosion? [00:13] (a specific book, of course) [00:15] trying to find some video you saw over a year ago on YouTube is like THIS VIDEO HAS BEEN TAKEN DOWN DUE TO A COPYRIGHT CLAIM FROM UNITED MEDIA CONGLOMERATE [00:18] well, except the video I am trying to find was entirely original content [00:19] doesn't stop someone from DMCAing it on a hunch [02:48] Coderjoe: Hahaha, yeah [02:49] Well, with something on the order of 36 hours of content uploaded a minute [02:49] about 25 hours of content dmca'd a minute [02:49] hahaa [02:50] tef: the decompression feature you added to warc2warc seems to work fine [02:50] tef: I'm uploading a warc2warc'd warc to a Wayback instance to check the result [02:50] If only we could watch all the new Youtube content Back to the Future II style. [02:51] damnit [02:51] I tripped over my DNS server [02:51] lol [02:52] or more specifically its serial cable [03:28] oh, hmm [03:28] tef: I think warc2warc may be losing data [03:28] anyone used this filesystem before? http://www.lessfs.com/ it describes itself as: A high performance inline data deduplicating filesystem for Linux. [03:29] tef: is there a place that I can send you examples? or would you prefer me to upload example WARCs somewhere?" [03:30] thomas.figg@hanzoarchives.com [03:30] ah, ok [03:30] it *shouldn't be losing data* obviously [03:30] right [03:30] I will look at it tomorrow morning, ish. Cos it's almost 4 am [03:30] I noticed, however, that one WARC record was truncated to zero bytes [03:31] sure, no problem [03:31] hmm, that would suggest something failed to decode [03:31] yeah, possibly. there's not much more that can be done until I send you examples though :) [03:31] so yeah, I'll write that up etc [03:37] obv if it is easier to upload them and link them, I am all for that too [03:38] I'm quite happy to work from 'this warc does not work as expected' [03:39] sure [03:39] actually, would you prefer that I file an issue in Bitbucket? [03:39] none of this is really private [03:41] i'm just happy with test data :3 [03:41] it's more than I get a lot of the time... [04:01] yipdw: that's perfect, thanks. I will get back to you tomorrow [06:40] mobileme-grab quit on me with "error downloading 'sboyack' - do I need to do anything special to make sure that gets grabbed again? [06:40] the reason is apparently because my hard drive filled up [07:21] savetz: you can either run dld-single.sh to explicitly redownload it, or just let it linger in the queue -- it'll be requeued at some point [12:03] SketchCow: seen this? http://www.metafilter.com/111701/Putting-kickstarter-out-of-business [12:03] Might be of interest since you've used Kickstarter [12:08] So.. they should be put out of business because they make money? [12:08] They offer an umbrella, a brand and a collected spot with people that like to fling their wallets open [12:09] 5% sounds a bit much, though [12:10] I'm not able to judge that, but I believe they take a lot of hassle out of asking people for money to do a project [12:11] true [12:12] It sounds like ramblings from an outsider that hasn't tried A) raising money B) tried out kickstarter [14:48] yipdw: found the bug [14:48] well technically it is in warc-wget :/ it is producing warc records, with http in them, that *claim* to be chunked, and aren't [14:51] I can put a work around in that broken http messages are left untouched. I'm not sure what is the right thing to do when you get a transfer-encoded:chunked with no chunks [14:56] fwiw - if you put the chunks back in, -D will clean them [17:13] the MobileMe grabber has been crashing on me in the last 24 hours. is apple being weird? last time= ERROR (3) [17:28] tef: got your message; that's some weird wget-warc behavior [17:29] I'll check what's actually coming back from the server [17:32] yipdw: I think I know why wget-warc is doing it. I think it snags the result after wget has handled transfer-encoding, rather than the raw network stream. [17:33] ISTR it hooks a temporary file or something into the output code [17:34] oh, so the response really is chunked but the warc code isn't able to write the chunks? [17:34] hm [17:34] one sec, I'm gonna see what curl does for some of these URIs [17:34] I haven't actually examined what comes back from these servers too closely [17:36] I'm fairly sure that is correct [17:37] well, yeah, the response from ff.net is definitely marked as chunked [17:38] hmm [17:39] I wonder if there's a way to get precisely what wget gets [17:40] yeah. move the hook into the network handling code, before the transfer-encoding handler (or perhaps IN the transfer-encoding handler?) [17:41] from the way the WARC code is structured, that doesn't look like it's a trivial job [17:41] or rather it looks like it was designed to sit at arm's length from the rest of wget [17:42] but I dunno, I've only skimmed over said code [17:42] alard would know best [17:46] damn [17:46] just been reading the gethttp function in wget's http.c [17:47] that could really use a state machine formalization [17:47] IMO, anyway, but I've been told that I'm weird [17:56] on a lighter note [17:56] http://tctechcrunch2011.files.wordpress.com/2011/11/screen-shot-2011-11-20-at-8-44-25-pm1.png?w=620 [18:35] warcs are meant to be *raw* traffic somewhat [18:36] well really - delete any transfer-encoding headers and add a content-length [18:36] or put the raw traffic in [18:36] hmm [18:37] my mobile-me downloader is crashing constantly. is it me? ERROR (3)s all the time [18:45] Ymgve: That guy is an idiot. [18:45] Kickstarter does SO MUCH MORE for the 5% [18:57] hey SketchCow [18:57] off the top of your head, what would be useful metadata fields for a floppy archival format? [18:58] yipdw: I think it would be easier to strip out the headers that don't apply - like transfer-encoding [19:08] one post: Short version: "I don't like something and don't want to use it, so therefore it should be made to go away." Oh FFS... [19:43] Oh, god. [19:44] OK, in short form: There is SO MUCH FUCKING WORK being done on that, the website would REALLY be improved with links to all that. [19:44] So when I get back from the interview I'm doing, let me find that. [19:44] I want to do some actualy archive team business today, so look forward to that. [19:51] Booya! [20:19] tef: I'll run that by alard, he's more familiar with wget's WARC code than me [20:19] if there ends up being an easy way to hook into what's coming over the wire that'd be worth trying too [20:20] yes [20:23] this'll be fun, though; I know that people in this channel have generated several terabytes of WARCs that may have invalid records :P [20:23] so going forward we'll need some way to fix that -- maybe stripping out Transfer-Encoding would be easiest [20:24] guess I'll poke around the warc-tools source and see how hard it'd be to write a tool like that [20:24] and I guess this would be an excuse to break out the Hadoop or something [20:40] fix-brokenass-shit.sh [20:55] chronomex: nah [20:55] fix-brokenass-shit.rb if I'm writing it, obviously [20:55] or maybe it can be FixBrokenassShit.hs and then everyone will be angry [20:59] lol [21:10] hHaskell? Excellent. [21:11] you can make more people mad by writing it in Erlang. [21:11] YOU DON'T NEED CONCURRENCY FOR THAT, WTF MAN [21:15] yes [21:15] one Erlang process per WARC [21:16] actually, that would probably scale pretty well [21:30] yipdw: I can hack warc2warc to do it [21:32] it's literally putting a content = content.replace("Transfer-Encoding: chunked\r\n","") or similar iirc [21:32] but i'd rather not a specific wget fix in the trunk, but it is easy to fork [21:33] I guess I can add it as an option 'sniff: chunking' [21:33] or -W for --wget-workaround [21:33] or something [21:35] tef: I was just going to write something for use by people here [21:35] as I think it's got the largest concentration of people who have used wget's WARC writer [21:35] or, really, just get something that SketchCow (or whoever) can run on a large batch of WARCs [21:35] and concurrently fix up wget's WARC writer to do The Right Thing [21:36] but, yeah, a fork works too [21:43] http://blog.archive.org/2012/01/17/12-hours-dark-internet-archive-vs-censorship/ [21:44] I didn't know IA was blacklisted in China [21:44] though I shouldn't be surprised [21:58] ditto [22:02] http://abcnews.go.com/Technology/wikipedia-blackout-websites-wikipedia-reddit-dark-wednesday-protest/story?id=15373251#.TxXwA2OXRN0 [22:02] what crappy lead graphics [22:03] what sort of two-bit hack graphic artists work for the major news media these days? [22:18] ha, and it's awesome seeing what Rupert Murdoch is tweeting [22:18] it seems that Google, a software company, is out to destroy "software creators" [22:18] yipdw: i'll just make a wget-warc-clean.py or something [22:18] i tend to avoid names like unfuck [22:18] :3 [22:18] that man is like Gene Ray, Cubic, but with money [22:18] tef: heh, good thing [22:25] http://torrentfreak.com/mpaa-internet-blackout-is-a-pr-stunt-users-are-corporate-pawns-120117/ [22:27] well, yeah, it is a PR stunt [22:27] good to know they can state the obvious [22:27] bleh. fix-broken-shit.py [22:27] work repo [22:28] wait [22:28] "The following is a statement by Senator Chris Dodd, Chairman and CEO of the Motion Picture Association of America, Inc. (MPAA)" [22:28] I didn't know corruption had gone tha far! [22:28] +t [22:29] man, talk about cutting out the middleman [22:29] why buy off legislature when you can just install someone to do your work [22:30] because legislature is cheap [22:30] $5000ish [22:30] (note, when I said that, I had just read the ruby/haskell/erlang part of my log) [22:30] chronomex: so are some Indian software houses, but that doesn't mean you'll get what you want [22:31] yipdw: like legislature! [22:31] it freaks me out when I call the bank credit card number and get someone in india. [22:33] I think that means you need a new bank [22:34] yeah... I have to wonder if that's actually how that card got compromised in the first place [22:36] lol [22:36] i've been working on leaving them. [22:36] but how do I know if a bank I am looking at has outsourced their calls? [22:37] and to where [22:38] (this is also one of the largest banks in the world we're talking about) [22:40] Coderjoe: didn't know you were into haskell [22:42] I'm curious where I gave the impression I was [22:42] "ruby/haskell/erlang" above [22:43] closure: yipdw is the one that mention haskell. I was reading my channel scrollback [22:43] Coderjoe eats GADTs for breakfast [22:43] s/(mention)/\1ed/ [22:45] >>= and <*> would be fun breakfast cereal indeed [22:47] cherri (.) s [23:18] Downloading the Sound of Young America - we'll have that going up today. [23:18] Awww yeah