Time |
Nickname |
Message |
00:13
🔗
|
Coderjoe |
trying to find some video you saw over a year ago on youtube is like ... |
00:13
🔗
|
Coderjoe |
trying to find a book in the remains of a library explosion? |
00:13
🔗
|
Coderjoe |
(a specific book, of course) |
00:15
🔗
|
yipdw |
trying to find some video you saw over a year ago on YouTube is like THIS VIDEO HAS BEEN TAKEN DOWN DUE TO A COPYRIGHT CLAIM FROM UNITED MEDIA CONGLOMERATE |
00:18
🔗
|
Coderjoe |
well, except the video I am trying to find was entirely original content |
00:19
🔗
|
Ymgve |
doesn't stop someone from DMCAing it on a hunch |
02:48
🔗
|
underscor |
Coderjoe: Hahaha, yeah |
02:49
🔗
|
underscor |
Well, with something on the order of 36 hours of content uploaded a minute |
02:49
🔗
|
chronomex |
about 25 hours of content dmca'd a minute |
02:49
🔗
|
underscor |
hahaa |
02:50
🔗
|
yipdw |
tef: the decompression feature you added to warc2warc seems to work fine |
02:50
🔗
|
yipdw |
tef: I'm uploading a warc2warc'd warc to a Wayback instance to check the result |
02:50
🔗
|
Paradoks |
If only we could watch all the new Youtube content Back to the Future II style. |
02:51
🔗
|
yipdw |
damnit |
02:51
🔗
|
yipdw |
I tripped over my DNS server |
02:51
🔗
|
chronomex |
lol |
02:52
🔗
|
yipdw |
or more specifically its serial cable |
03:28
🔗
|
yipdw |
oh, hmm |
03:28
🔗
|
yipdw |
tef: I think warc2warc may be losing data |
03:28
🔗
|
dashcloud |
anyone used this filesystem before? http://www.lessfs.com/ it describes itself as: A high performance inline data deduplicating filesystem for Linux. |
03:29
🔗
|
yipdw |
tef: is there a place that I can send you examples? or would you prefer me to upload example WARCs somewhere?" |
03:30
🔗
|
tef |
thomas.figg@hanzoarchives.com |
03:30
🔗
|
yipdw |
ah, ok |
03:30
🔗
|
tef |
it *shouldn't be losing data* obviously |
03:30
🔗
|
yipdw |
right |
03:30
🔗
|
tef |
I will look at it tomorrow morning, ish. Cos it's almost 4 am |
03:30
🔗
|
yipdw |
I noticed, however, that one WARC record was truncated to zero bytes |
03:31
🔗
|
yipdw |
sure, no problem |
03:31
🔗
|
tef |
hmm, that would suggest something failed to decode |
03:31
🔗
|
yipdw |
yeah, possibly. there's not much more that can be done until I send you examples though :) |
03:31
🔗
|
yipdw |
so yeah, I'll write that up etc |
03:37
🔗
|
tef |
obv if it is easier to upload them and link them, I am all for that too |
03:38
🔗
|
tef |
I'm quite happy to work from 'this warc does not work as expected' |
03:39
🔗
|
yipdw |
sure |
03:39
🔗
|
yipdw |
actually, would you prefer that I file an issue in Bitbucket? |
03:39
🔗
|
yipdw |
none of this is really private |
03:41
🔗
|
tef |
i'm just happy with test data :3 |
03:41
🔗
|
tef |
it's more than I get a lot of the time... |
04:01
🔗
|
tef |
yipdw: that's perfect, thanks. I will get back to you tomorrow |
06:40
🔗
|
savetz |
mobileme-grab quit on me with "error downloading 'sboyack' - do I need to do anything special to make sure that gets grabbed again? |
06:40
🔗
|
savetz |
the reason is apparently because my hard drive filled up |
07:21
🔗
|
yipdw |
savetz: you can either run dld-single.sh to explicitly redownload it, or just let it linger in the queue -- it'll be requeued at some point |
12:03
🔗
|
Ymgve |
SketchCow: seen this? http://www.metafilter.com/111701/Putting-kickstarter-out-of-business |
12:03
🔗
|
Ymgve |
Might be of interest since you've used Kickstarter |
12:08
🔗
|
ersi |
So.. they should be put out of business because they make money? |
12:08
🔗
|
ersi |
They offer an umbrella, a brand and a collected spot with people that like to fling their wallets open |
12:09
🔗
|
Ymgve |
5% sounds a bit much, though |
12:10
🔗
|
ersi |
I'm not able to judge that, but I believe they take a lot of hassle out of asking people for money to do a project |
12:11
🔗
|
Ymgve |
true |
12:12
🔗
|
ersi |
It sounds like ramblings from an outsider that hasn't tried A) raising money B) tried out kickstarter |
14:48
🔗
|
tef |
yipdw: found the bug |
14:48
🔗
|
tef |
well technically it is in warc-wget :/ it is producing warc records, with http in them, that *claim* to be chunked, and aren't |
14:51
🔗
|
tef |
I can put a work around in that broken http messages are left untouched. I'm not sure what is the right thing to do when you get a transfer-encoded:chunked with no chunks |
14:56
🔗
|
tef |
fwiw - if you put the chunks back in, -D will clean them |
17:13
🔗
|
savetz |
the MobileMe grabber has been crashing on me in the last 24 hours. is apple being weird? last time= ERROR (3) |
17:28
🔗
|
yipdw |
tef: got your message; that's some weird wget-warc behavior |
17:29
🔗
|
yipdw |
I'll check what's actually coming back from the server |
17:32
🔗
|
Coderjoe_ |
yipdw: I think I know why wget-warc is doing it. I think it snags the result after wget has handled transfer-encoding, rather than the raw network stream. |
17:33
🔗
|
Coderjoe_ |
ISTR it hooks a temporary file or something into the output code |
17:34
🔗
|
yipdw |
oh, so the response really is chunked but the warc code isn't able to write the chunks? |
17:34
🔗
|
yipdw |
hm |
17:34
🔗
|
yipdw |
one sec, I'm gonna see what curl does for some of these URIs |
17:34
🔗
|
yipdw |
I haven't actually examined what comes back from these servers too closely |
17:36
🔗
|
Coderjoe |
I'm fairly sure that is correct |
17:37
🔗
|
yipdw |
well, yeah, the response from ff.net is definitely marked as chunked |
17:38
🔗
|
yipdw |
hmm |
17:39
🔗
|
yipdw |
I wonder if there's a way to get precisely what wget gets |
17:40
🔗
|
Coderjoe |
yeah. move the hook into the network handling code, before the transfer-encoding handler (or perhaps IN the transfer-encoding handler?) |
17:41
🔗
|
yipdw |
from the way the WARC code is structured, that doesn't look like it's a trivial job |
17:41
🔗
|
yipdw |
or rather it looks like it was designed to sit at arm's length from the rest of wget |
17:42
🔗
|
yipdw |
but I dunno, I've only skimmed over said code |
17:42
🔗
|
yipdw |
alard would know best |
17:46
🔗
|
yipdw |
damn |
17:46
🔗
|
yipdw |
just been reading the gethttp function in wget's http.c |
17:47
🔗
|
yipdw |
that could really use a state machine formalization |
17:47
🔗
|
yipdw |
IMO, anyway, but I've been told that I'm weird |
17:56
🔗
|
yipdw |
on a lighter note |
17:56
🔗
|
yipdw |
http://tctechcrunch2011.files.wordpress.com/2011/11/screen-shot-2011-11-20-at-8-44-25-pm1.png?w=620 |
18:35
🔗
|
tef |
warcs are meant to be *raw* traffic somewhat |
18:36
🔗
|
tef |
well really - delete any transfer-encoding headers and add a content-length |
18:36
🔗
|
tef |
or put the raw traffic in |
18:36
🔗
|
tef |
hmm |
18:37
🔗
|
savetz |
my mobile-me downloader is crashing constantly. is it me? ERROR (3)s all the time |
18:45
🔗
|
SketchCow |
Ymgve: That guy is an idiot. |
18:45
🔗
|
SketchCow |
Kickstarter does SO MUCH MORE for the 5% |
18:57
🔗
|
balrog |
hey SketchCow |
18:57
🔗
|
balrog |
off the top of your head, what would be useful metadata fields for a floppy archival format? |
18:58
🔗
|
tef |
yipdw: I think it would be easier to strip out the headers that don't apply - like transfer-encoding |
19:08
🔗
|
Coderjoe |
one post: Short version: "I don't like something and don't want to use it, so therefore it should be made to go away." Oh FFS... |
19:43
🔗
|
SketchCow |
Oh, god. |
19:44
🔗
|
SketchCow |
OK, in short form: There is SO MUCH FUCKING WORK being done on that, the website would REALLY be improved with links to all that. |
19:44
🔗
|
SketchCow |
So when I get back from the interview I'm doing, let me find that. |
19:44
🔗
|
SketchCow |
I want to do some actualy archive team business today, so look forward to that. |
19:51
🔗
|
ersi |
Booya! |
20:19
🔗
|
yipdw |
tef: I'll run that by alard, he's more familiar with wget's WARC code than me |
20:19
🔗
|
yipdw |
if there ends up being an easy way to hook into what's coming over the wire that'd be worth trying too |
20:20
🔗
|
tef |
yes |
20:23
🔗
|
yipdw |
this'll be fun, though; I know that people in this channel have generated several terabytes of WARCs that may have invalid records :P |
20:23
🔗
|
yipdw |
so going forward we'll need some way to fix that -- maybe stripping out Transfer-Encoding would be easiest |
20:24
🔗
|
yipdw |
guess I'll poke around the warc-tools source and see how hard it'd be to write a tool like that |
20:24
🔗
|
yipdw |
and I guess this would be an excuse to break out the Hadoop or something |
20:40
🔗
|
chronomex |
fix-brokenass-shit.sh |
20:55
🔗
|
yipdw |
chronomex: nah |
20:55
🔗
|
yipdw |
fix-brokenass-shit.rb if I'm writing it, obviously |
20:55
🔗
|
yipdw |
or maybe it can be FixBrokenassShit.hs and then everyone will be angry |
20:59
🔗
|
underscor |
lol |
21:10
🔗
|
chronomex |
hHaskell? Excellent. |
21:11
🔗
|
chronomex |
you can make more people mad by writing it in Erlang. |
21:11
🔗
|
chronomex |
YOU DON'T NEED CONCURRENCY FOR THAT, WTF MAN |
21:15
🔗
|
yipdw |
yes |
21:15
🔗
|
yipdw |
one Erlang process per WARC |
21:16
🔗
|
yipdw |
actually, that would probably scale pretty well |
21:30
🔗
|
tef |
yipdw: I can hack warc2warc to do it |
21:32
🔗
|
tef |
it's literally putting a content = content.replace("Transfer-Encoding: chunked\r\n","") or similar iirc |
21:32
🔗
|
tef |
but i'd rather not a specific wget fix in the trunk, but it is easy to fork |
21:33
🔗
|
tef |
I guess I can add it as an option 'sniff: chunking' |
21:33
🔗
|
tef |
or -W for --wget-workaround |
21:33
🔗
|
tef |
or something |
21:35
🔗
|
yipdw |
tef: I was just going to write something for use by people here |
21:35
🔗
|
yipdw |
as I think it's got the largest concentration of people who have used wget's WARC writer |
21:35
🔗
|
yipdw |
or, really, just get something that SketchCow (or whoever) can run on a large batch of WARCs |
21:35
🔗
|
yipdw |
and concurrently fix up wget's WARC writer to do The Right Thing |
21:36
🔗
|
yipdw |
but, yeah, a fork works too |
21:43
🔗
|
closure |
http://blog.archive.org/2012/01/17/12-hours-dark-internet-archive-vs-censorship/ |
21:44
🔗
|
yipdw |
I didn't know IA was blacklisted in China |
21:44
🔗
|
yipdw |
though I shouldn't be surprised |
21:58
🔗
|
chronomex |
ditto |
22:02
🔗
|
yipdw |
http://abcnews.go.com/Technology/wikipedia-blackout-websites-wikipedia-reddit-dark-wednesday-protest/story?id=15373251#.TxXwA2OXRN0 |
22:02
🔗
|
yipdw |
what crappy lead graphics |
22:03
🔗
|
yipdw |
what sort of two-bit hack graphic artists work for the major news media these days? |
22:18
🔗
|
yipdw |
ha, and it's awesome seeing what Rupert Murdoch is tweeting |
22:18
🔗
|
yipdw |
it seems that Google, a software company, is out to destroy "software creators" |
22:18
🔗
|
tef |
yipdw: i'll just make a wget-warc-clean.py or something |
22:18
🔗
|
tef |
i tend to avoid names like unfuck |
22:18
🔗
|
tef |
:3 |
22:18
🔗
|
yipdw |
that man is like Gene Ray, Cubic, but with money |
22:18
🔗
|
yipdw |
tef: heh, good thing |
22:25
🔗
|
NovaKing |
http://torrentfreak.com/mpaa-internet-blackout-is-a-pr-stunt-users-are-corporate-pawns-120117/ |
22:27
🔗
|
yipdw |
well, yeah, it is a PR stunt |
22:27
🔗
|
yipdw |
good to know they can state the obvious |
22:27
🔗
|
Coderjoe |
bleh. fix-broken-shit.py |
22:27
🔗
|
tef |
work repo |
22:28
🔗
|
yipdw |
wait |
22:28
🔗
|
yipdw |
"The following is a statement by Senator Chris Dodd, Chairman and CEO of the Motion Picture Association of America, Inc. (MPAA)" |
22:28
🔗
|
yipdw |
I didn't know corruption had gone tha far! |
22:28
🔗
|
yipdw |
+t |
22:29
🔗
|
yipdw |
man, talk about cutting out the middleman |
22:29
🔗
|
yipdw |
why buy off legislature when you can just install someone to do your work |
22:30
🔗
|
chronomex |
because legislature is cheap |
22:30
🔗
|
chronomex |
$5000ish |
22:30
🔗
|
Coderjoe |
(note, when I said that, I had just read the ruby/haskell/erlang part of my log) |
22:30
🔗
|
yipdw |
chronomex: so are some Indian software houses, but that doesn't mean you'll get what you want |
22:31
🔗
|
chronomex |
yipdw: like legislature! |
22:31
🔗
|
Coderjoe |
it freaks me out when I call the bank credit card number and get someone in india. |
22:33
🔗
|
yipdw |
I think that means you need a new bank |
22:34
🔗
|
Coderjoe |
yeah... I have to wonder if that's actually how that card got compromised in the first place |
22:36
🔗
|
chronomex |
lol |
22:36
🔗
|
Coderjoe |
i've been working on leaving them. |
22:36
🔗
|
Coderjoe |
but how do I know if a bank I am looking at has outsourced their calls? |
22:37
🔗
|
Coderjoe |
and to where |
22:38
🔗
|
Coderjoe |
(this is also one of the largest banks in the world we're talking about) |
22:40
🔗
|
closure |
Coderjoe: didn't know you were into haskell |
22:42
🔗
|
Coderjoe |
I'm curious where I gave the impression I was |
22:42
🔗
|
closure |
"ruby/haskell/erlang" above |
22:43
🔗
|
Coderjoe |
closure: yipdw is the one that mention haskell. I was reading my channel scrollback |
22:43
🔗
|
yipdw |
Coderjoe eats GADTs for breakfast |
22:43
🔗
|
Coderjoe |
s/(mention)/\1ed/ |
22:45
🔗
|
closure |
>>= and <*> would be fun breakfast cereal indeed |
22:47
🔗
|
yipdw |
cherri (.) s |
23:18
🔗
|
SketchCow |
Downloading the Sound of Young America - we'll have that going up today. |
23:18
🔗
|
SketchCow |
Awww yeah |