[00:46] SAN FRANCISCO, Sept. 20, 2012 -- Salon Media Group (SLNM.PK) and The Well Group, [00:46] Inc. today jointly announced that The WELL is now under the ownership of The Well Group, [00:46] Inc., a private investment group composed of long-time WELL members. [00:46] [00:46] The Well Group, Inc. consists entirely of long-time WELL users with an average tenure [00:46] exceeding 20 years. The purchase marks the first major online business taken private by users of [00:46] the business itself. [00:46] ....... that was unexpected [01:04] * ivan` just discovers http://www.archiveteam.org/index.php?title=Wget_with_Lua_hooks - very cool [01:10] http://largedownloads.ea.com/pub/ might be of interest to someone here; I have no disk space [01:14] DFJustin: http://www.youtube.com/user/Tork110/videos has a lot of great game footage as well [02:53] well, that's awesome. A panic grab of the Well was always going to suck mightily [02:55] btw, I grabbed xkcd today. I doubt the comic 1111 = last rumors after yesterday's comic, but better safe than sorry [02:59] hmm, speaking of Salon ... http://www.salon.com/2012/09/20/history_as_recorded_on_twitter_is_vanishing/ [03:33] chronomex: You could just parse through the WARC later, finding all the src="" and do another pass on those URLs [03:49] i thought twitter donated their entire backlog of tweets up to 2010 or 2011 to the library of congress [03:49] i guess everything past then may decay though [03:51] article is talking about the external links [03:57] the article isn't sure what it's talking about, but it's positive twitter will always be around & have the data if you know how to dig it out [04:09] ersi: that's a stupid patch, and anyway it won't necessarily get the full set of resources needed to render the page [04:16] chronomex: Patch? What? [04:16] you know, an ad-hoc fix [04:16] ah, heh [04:17] besides [04:17] I've wanted to put a warc-making proxy behind my main webbrowser for a while [04:17] disk is cheap [04:17] websites disappear [04:18] full-text search of every page you've ever viewed would be awesome [04:18] In that case check out YaCy? [04:19] not really what I'm interested in [04:19] why don't you let me make what I want to make in peace [04:21] Haha but I love suggesting almost relevant software to you [04:22] yes [04:23] it's a 85% buzzword-match [04:23] suggest it! [07:09] chronomex: btw you can write resource records instead of request/response records for warcs [07:09] what's the advantage of that? [07:09] chronomex: turns out I overlooked part of the spec, but really resource (files sans http traffic) is really meant for legacy record conversion and other protocols [07:10] part of the pain in writing a proxy is that you have to grab the http traffic [07:10] as most of the software revolves around hiding them/not logging them [07:10] ah [07:10] yeah [07:10] couldn't find a nice way to hack requests for instance [07:11] https://github.com/tef/warctools/tree/master/hanzo/httptools [07:11] sooo I wrote my own http parser in python [07:11] which can be used for both flat files and sockets [07:13] chronomex: but yeah you may have more luck managing it that way [07:14] I did look at the proxy rewriting protocol myself - can't remember why I avoided it, seemed like too much work [07:15] chronomex: your other option is to try hacking it into webkit [07:15] noooooo [07:15] c.f http://browsertoolkit.com/fault-tolerance.png [07:16] hahaha [07:16] YUP [07:16] I don't quite see the problem with a simple proxy that stuffs every request/response to disk [07:16] but maybe I'm just stupid and like to think simplisically [07:17] there isn't one [07:17] the only problem is that almost all proxies are not designed to log the entire traffic [07:17] right [07:17] maybe you could write an ethercap plugin :v [07:17] why not just write a proxy? [07:17] you can [07:17] it's relatively straight forward, except for http parsing [07:18] oh and doing a man in the middle attack on ssl :-) [07:18] heh [07:18] ssl is out of scope for this project [07:18] oh [07:18] what about pipelining :-) [07:19] heh [07:19] * chronomex shrugs [07:19] but yeah, if you want to go down that route, I can help somewhat [07:19] https://github.com/tef/warctools/blob/master/hanzo/httptools/messaging.py [07:20] You might have fun rewriting GET http://blah/foo into GET /foo \r\n Host: blah [07:21] your blowing my minds. [07:21] chronomex: it supports reading from text and from fds :3 [07:21] :O [07:24] https://github.com/tef/warctools/blob/master/hanzo/httptools/messaging.py#L204 [07:24] this is useful for writing warcs without gzip or chunked encoding :-) [07:24] I'm not sure if that's appropriate [07:24] isn't warc supposed to be about preserving the content on the wire? [07:25] it's ok for a proxy to remove them [07:25] or is it more about preserving the meaningful content [07:25] proxies are allowed to vary tranfer-encoding, and content-encoding too iirc [07:25] sure [07:25] you're talking about a proxy [07:25] I'm talking about archiving [07:25] if you like you can write warcs as is, sure [07:25] why does nobody on the internet understand me today [07:26] we had legacy issues preventing doing it like so [07:26] it's also nice for canonicalizing [07:26] Technically you *should* mutate the headers too, as proxies have to munge incoming requests somewhat [07:26] right [07:26] as it turns out not all webservers like full URLs in the method line [07:27] but your web browser will be sending GET http://... to the proxy [07:27] that's not surprising [07:27] well, yeah [07:27] and have to add a Host: header maybe [07:27] luckily most seem to add a Host: header too, so you only need to strip the first line [07:28] the most annoying thing for me was the logic of when to pipeline and when to wait for close [07:29] mmm [07:29] oh and weird little bugs like people who miss out phrases in response headers, i.e '200 OK' [07:30] but yeah, it's a complete and tested http parsing library :3 [07:30] thx [07:30] may/may not help you somewhat [07:32] req = httptools.RequestMessage(), req.feed(....), req.close(), [07:32] resp = httptools.ResponseMessage(req), resp.feed(...), resp.close(), [07:34] and things like .complete(), headers_complete(), [07:34] it's not brilliant :v it is still bad for using bytearray for a buffer aaand using append() [07:35] btw you need the req to build a response, because for HEAD requests there will be no body to the response [07:35] *shakes fist at http* [07:36] arrr [07:38] i'm annoyed I can't officially commit this to warctools due to politics [07:38] :( [07:38] I can on the other hand make it reaaaly easy [07:38] anyway, enough tv, bedtime now [07:38] night! [09:31] alard: I see you've made chances to warctools - warctozip.py - would you have anything against them being commited to the hanzo repository ? [09:32] alard: assuming you're happy with MIT licensing [09:41] tef_: Which changes? [09:42] https://github.com/alard/warctozip [09:44] alard: warctozip.py basically [09:44] unless you've got other fixes or improvements [09:44] Ah, okay. No, I don't remember any other changes. [09:45] (my hands are tied on implementing some changes, but i'm free to import anyone elses) [09:46] so I might add warctozip.py to it, and maybe make it behave like the others with the same options [09:46] Well, that's not the best version of the script. There are a few useful improvements in https://github.com/alard/warctozip-service/ [09:46] Things like timestamps, better url-to-filename conversion. [09:47] I could add those things to warctozip.py and send you the improved version later. [09:47] awesome [09:50] could probably pull out a function and stick it in warc.py write_zip(record, zip_file) [09:56] I'll have at combining the scripts later. [10:20] i got a star wars last dinner picture from underground gamer [21:08] I have been given 25% more pipeline for submitting magazines