#archiveteam 2012-01-17,Tue

↑back Search

Time Nickname Message
00:13 🔗 Coderjoe trying to find some video you saw over a year ago on youtube is like ...
00:13 🔗 Coderjoe trying to find a book in the remains of a library explosion?
00:13 🔗 Coderjoe (a specific book, of course)
00:15 🔗 yipdw trying to find some video you saw over a year ago on YouTube is like THIS VIDEO HAS BEEN TAKEN DOWN DUE TO A COPYRIGHT CLAIM FROM UNITED MEDIA CONGLOMERATE
00:18 🔗 Coderjoe well, except the video I am trying to find was entirely original content
00:19 🔗 Ymgve doesn't stop someone from DMCAing it on a hunch
02:48 🔗 underscor Coderjoe: Hahaha, yeah
02:49 🔗 underscor Well, with something on the order of 36 hours of content uploaded a minute
02:49 🔗 chronomex about 25 hours of content dmca'd a minute
02:49 🔗 underscor hahaa
02:50 🔗 yipdw tef: the decompression feature you added to warc2warc seems to work fine
02:50 🔗 yipdw tef: I'm uploading a warc2warc'd warc to a Wayback instance to check the result
02:50 🔗 Paradoks If only we could watch all the new Youtube content Back to the Future II style.
02:51 🔗 yipdw damnit
02:51 🔗 yipdw I tripped over my DNS server
02:51 🔗 chronomex lol
02:52 🔗 yipdw or more specifically its serial cable
03:28 🔗 yipdw oh, hmm
03:28 🔗 yipdw tef: I think warc2warc may be losing data
03:28 🔗 dashcloud anyone used this filesystem before? http://www.lessfs.com/ it describes itself as: A high performance inline data deduplicating filesystem for Linux.
03:29 🔗 yipdw tef: is there a place that I can send you examples? or would you prefer me to upload example WARCs somewhere?"
03:30 🔗 tef thomas.figg@hanzoarchives.com
03:30 🔗 yipdw ah, ok
03:30 🔗 tef it *shouldn't be losing data* obviously
03:30 🔗 yipdw right
03:30 🔗 tef I will look at it tomorrow morning, ish. Cos it's almost 4 am
03:30 🔗 yipdw I noticed, however, that one WARC record was truncated to zero bytes
03:31 🔗 yipdw sure, no problem
03:31 🔗 tef hmm, that would suggest something failed to decode
03:31 🔗 yipdw yeah, possibly. there's not much more that can be done until I send you examples though :)
03:31 🔗 yipdw so yeah, I'll write that up etc
03:37 🔗 tef obv if it is easier to upload them and link them, I am all for that too
03:38 🔗 tef I'm quite happy to work from 'this warc does not work as expected'
03:39 🔗 yipdw sure
03:39 🔗 yipdw actually, would you prefer that I file an issue in Bitbucket?
03:39 🔗 yipdw none of this is really private
03:41 🔗 tef i'm just happy with test data :3
03:41 🔗 tef it's more than I get a lot of the time...
04:01 🔗 tef yipdw: that's perfect, thanks. I will get back to you tomorrow
06:40 🔗 savetz mobileme-grab quit on me with "error downloading 'sboyack' - do I need to do anything special to make sure that gets grabbed again?
06:40 🔗 savetz the reason is apparently because my hard drive filled up
07:21 🔗 yipdw savetz: you can either run dld-single.sh to explicitly redownload it, or just let it linger in the queue -- it'll be requeued at some point
12:03 🔗 Ymgve SketchCow: seen this? http://www.metafilter.com/111701/Putting-kickstarter-out-of-business
12:03 🔗 Ymgve Might be of interest since you've used Kickstarter
12:08 🔗 ersi So.. they should be put out of business because they make money?
12:08 🔗 ersi They offer an umbrella, a brand and a collected spot with people that like to fling their wallets open
12:09 🔗 Ymgve 5% sounds a bit much, though
12:10 🔗 ersi I'm not able to judge that, but I believe they take a lot of hassle out of asking people for money to do a project
12:11 🔗 Ymgve true
12:12 🔗 ersi It sounds like ramblings from an outsider that hasn't tried A) raising money B) tried out kickstarter
14:48 🔗 tef yipdw: found the bug
14:48 🔗 tef well technically it is in warc-wget :/ it is producing warc records, with http in them, that *claim* to be chunked, and aren't
14:51 🔗 tef I can put a work around in that broken http messages are left untouched. I'm not sure what is the right thing to do when you get a transfer-encoded:chunked with no chunks
14:56 🔗 tef fwiw - if you put the chunks back in, -D will clean them
17:13 🔗 savetz the MobileMe grabber has been crashing on me in the last 24 hours. is apple being weird? last time= ERROR (3)
17:28 🔗 yipdw tef: got your message; that's some weird wget-warc behavior
17:29 🔗 yipdw I'll check what's actually coming back from the server
17:32 🔗 Coderjoe_ yipdw: I think I know why wget-warc is doing it. I think it snags the result after wget has handled transfer-encoding, rather than the raw network stream.
17:33 🔗 Coderjoe_ ISTR it hooks a temporary file or something into the output code
17:34 🔗 yipdw oh, so the response really is chunked but the warc code isn't able to write the chunks?
17:34 🔗 yipdw hm
17:34 🔗 yipdw one sec, I'm gonna see what curl does for some of these URIs
17:34 🔗 yipdw I haven't actually examined what comes back from these servers too closely
17:36 🔗 Coderjoe I'm fairly sure that is correct
17:37 🔗 yipdw well, yeah, the response from ff.net is definitely marked as chunked
17:38 🔗 yipdw hmm
17:39 🔗 yipdw I wonder if there's a way to get precisely what wget gets
17:40 🔗 Coderjoe yeah. move the hook into the network handling code, before the transfer-encoding handler (or perhaps IN the transfer-encoding handler?)
17:41 🔗 yipdw from the way the WARC code is structured, that doesn't look like it's a trivial job
17:41 🔗 yipdw or rather it looks like it was designed to sit at arm's length from the rest of wget
17:42 🔗 yipdw but I dunno, I've only skimmed over said code
17:42 🔗 yipdw alard would know best
17:46 🔗 yipdw damn
17:46 🔗 yipdw just been reading the gethttp function in wget's http.c
17:47 🔗 yipdw that could really use a state machine formalization
17:47 🔗 yipdw IMO, anyway, but I've been told that I'm weird
17:56 🔗 yipdw on a lighter note
17:56 🔗 yipdw http://tctechcrunch2011.files.wordpress.com/2011/11/screen-shot-2011-11-20-at-8-44-25-pm1.png?w=620
18:35 🔗 tef warcs are meant to be *raw* traffic somewhat
18:36 🔗 tef well really - delete any transfer-encoding headers and add a content-length
18:36 🔗 tef or put the raw traffic in
18:36 🔗 tef hmm
18:37 🔗 savetz my mobile-me downloader is crashing constantly. is it me? ERROR (3)s all the time
18:45 🔗 SketchCow Ymgve: That guy is an idiot.
18:45 🔗 SketchCow Kickstarter does SO MUCH MORE for the 5%
18:57 🔗 balrog hey SketchCow
18:57 🔗 balrog off the top of your head, what would be useful metadata fields for a floppy archival format?
18:58 🔗 tef yipdw: I think it would be easier to strip out the headers that don't apply - like transfer-encoding
19:08 🔗 Coderjoe one post: Short version: "I don't like something and don't want to use it, so therefore it should be made to go away." Oh FFS...
19:43 🔗 SketchCow Oh, god.
19:44 🔗 SketchCow OK, in short form: There is SO MUCH FUCKING WORK being done on that, the website would REALLY be improved with links to all that.
19:44 🔗 SketchCow So when I get back from the interview I'm doing, let me find that.
19:44 🔗 SketchCow I want to do some actualy archive team business today, so look forward to that.
19:51 🔗 ersi Booya!
20:19 🔗 yipdw tef: I'll run that by alard, he's more familiar with wget's WARC code than me
20:19 🔗 yipdw if there ends up being an easy way to hook into what's coming over the wire that'd be worth trying too
20:20 🔗 tef yes
20:23 🔗 yipdw this'll be fun, though; I know that people in this channel have generated several terabytes of WARCs that may have invalid records :P
20:23 🔗 yipdw so going forward we'll need some way to fix that -- maybe stripping out Transfer-Encoding would be easiest
20:24 🔗 yipdw guess I'll poke around the warc-tools source and see how hard it'd be to write a tool like that
20:24 🔗 yipdw and I guess this would be an excuse to break out the Hadoop or something
20:40 🔗 chronomex fix-brokenass-shit.sh
20:55 🔗 yipdw chronomex: nah
20:55 🔗 yipdw fix-brokenass-shit.rb if I'm writing it, obviously
20:55 🔗 yipdw or maybe it can be FixBrokenassShit.hs and then everyone will be angry
20:59 🔗 underscor lol
21:10 🔗 chronomex hHaskell? Excellent.
21:11 🔗 chronomex you can make more people mad by writing it in Erlang.
21:11 🔗 chronomex YOU DON'T NEED CONCURRENCY FOR THAT, WTF MAN
21:15 🔗 yipdw yes
21:15 🔗 yipdw one Erlang process per WARC
21:16 🔗 yipdw actually, that would probably scale pretty well
21:30 🔗 tef yipdw: I can hack warc2warc to do it
21:32 🔗 tef it's literally putting a content = content.replace("Transfer-Encoding: chunked\r\n","") or similar iirc
21:32 🔗 tef but i'd rather not a specific wget fix in the trunk, but it is easy to fork
21:33 🔗 tef I guess I can add it as an option 'sniff: chunking'
21:33 🔗 tef or -W for --wget-workaround
21:33 🔗 tef or something
21:35 🔗 yipdw tef: I was just going to write something for use by people here
21:35 🔗 yipdw as I think it's got the largest concentration of people who have used wget's WARC writer
21:35 🔗 yipdw or, really, just get something that SketchCow (or whoever) can run on a large batch of WARCs
21:35 🔗 yipdw and concurrently fix up wget's WARC writer to do The Right Thing
21:36 🔗 yipdw but, yeah, a fork works too
21:43 🔗 closure http://blog.archive.org/2012/01/17/12-hours-dark-internet-archive-vs-censorship/
21:44 🔗 yipdw I didn't know IA was blacklisted in China
21:44 🔗 yipdw though I shouldn't be surprised
21:58 🔗 chronomex ditto
22:02 🔗 yipdw http://abcnews.go.com/Technology/wikipedia-blackout-websites-wikipedia-reddit-dark-wednesday-protest/story?id=15373251#.TxXwA2OXRN0
22:02 🔗 yipdw what crappy lead graphics
22:03 🔗 yipdw what sort of two-bit hack graphic artists work for the major news media these days?
22:18 🔗 yipdw ha, and it's awesome seeing what Rupert Murdoch is tweeting
22:18 🔗 yipdw it seems that Google, a software company, is out to destroy "software creators"
22:18 🔗 tef yipdw: i'll just make a wget-warc-clean.py or something
22:18 🔗 tef i tend to avoid names like unfuck
22:18 🔗 tef :3
22:18 🔗 yipdw that man is like Gene Ray, Cubic, but with money
22:18 🔗 yipdw tef: heh, good thing
22:25 🔗 NovaKing http://torrentfreak.com/mpaa-internet-blackout-is-a-pr-stunt-users-are-corporate-pawns-120117/
22:27 🔗 yipdw well, yeah, it is a PR stunt
22:27 🔗 yipdw good to know they can state the obvious
22:27 🔗 Coderjoe bleh. fix-broken-shit.py
22:27 🔗 tef work repo
22:28 🔗 yipdw wait
22:28 🔗 yipdw "The following is a statement by Senator Chris Dodd, Chairman and CEO of the Motion Picture Association of America, Inc. (MPAA)"
22:28 🔗 yipdw I didn't know corruption had gone tha far!
22:28 🔗 yipdw +t
22:29 🔗 yipdw man, talk about cutting out the middleman
22:29 🔗 yipdw why buy off legislature when you can just install someone to do your work
22:30 🔗 chronomex because legislature is cheap
22:30 🔗 chronomex $5000ish
22:30 🔗 Coderjoe (note, when I said that, I had just read the ruby/haskell/erlang part of my log)
22:30 🔗 yipdw chronomex: so are some Indian software houses, but that doesn't mean you'll get what you want
22:31 🔗 chronomex yipdw: like legislature!
22:31 🔗 Coderjoe it freaks me out when I call the bank credit card number and get someone in india.
22:33 🔗 yipdw I think that means you need a new bank
22:34 🔗 Coderjoe yeah... I have to wonder if that's actually how that card got compromised in the first place
22:36 🔗 chronomex lol
22:36 🔗 Coderjoe i've been working on leaving them.
22:36 🔗 Coderjoe but how do I know if a bank I am looking at has outsourced their calls?
22:37 🔗 Coderjoe and to where
22:38 🔗 Coderjoe (this is also one of the largest banks in the world we're talking about)
22:40 🔗 closure Coderjoe: didn't know you were into haskell
22:42 🔗 Coderjoe I'm curious where I gave the impression I was
22:42 🔗 closure "ruby/haskell/erlang" above
22:43 🔗 Coderjoe closure: yipdw is the one that mention haskell. I was reading my channel scrollback
22:43 🔗 yipdw Coderjoe eats GADTs for breakfast
22:43 🔗 Coderjoe s/(mention)/\1ed/
22:45 🔗 closure >>= and <*> would be fun breakfast cereal indeed
22:47 🔗 yipdw cherri (.) s
23:18 🔗 SketchCow Downloading the Sound of Young America - we'll have that going up today.
23:18 🔗 SketchCow Awww yeah

irclogger-viewer