Time |
Nickname |
Message |
00:46
🔗
|
SketchCow |
SAN FRANCISCO, Sept. 20, 2012 -- Salon Media Group (SLNM.PK) and The Well Group, |
00:46
🔗
|
SketchCow |
Inc. today jointly announced that The WELL is now under the ownership of The Well Group, |
00:46
🔗
|
SketchCow |
Inc., a private investment group composed of long-time WELL members. |
00:46
🔗
|
SketchCow |
|
00:46
🔗
|
SketchCow |
The Well Group, Inc. consists entirely of long-time WELL users with an average tenure |
00:46
🔗
|
SketchCow |
exceeding 20 years. The purchase marks the first major online business taken private by users of |
00:46
🔗
|
SketchCow |
the business itself. |
00:46
🔗
|
SketchCow |
....... that was unexpected |
01:04
🔗
|
* |
ivan` just discovers http://www.archiveteam.org/index.php?title=Wget_with_Lua_hooks - very cool |
01:10
🔗
|
ivan` |
http://largedownloads.ea.com/pub/ might be of interest to someone here; I have no disk space |
01:14
🔗
|
ivan` |
DFJustin: http://www.youtube.com/user/Tork110/videos has a lot of great game footage as well |
02:53
🔗
|
closure |
well, that's awesome. A panic grab of the Well was always going to suck mightily |
02:55
🔗
|
closure |
btw, I grabbed xkcd today. I doubt the comic 1111 = last rumors after yesterday's comic, but better safe than sorry |
02:59
🔗
|
closure |
hmm, speaking of Salon ... http://www.salon.com/2012/09/20/history_as_recorded_on_twitter_is_vanishing/ |
03:33
🔗
|
ersi |
chronomex: You could just parse through the WARC later, finding all the src="" and do another pass on those URLs |
03:49
🔗
|
Lord_Nigh |
i thought twitter donated their entire backlog of tweets up to 2010 or 2011 to the library of congress |
03:49
🔗
|
Lord_Nigh |
i guess everything past then may decay though |
03:51
🔗
|
DFJustin |
article is talking about the external links |
03:57
🔗
|
closure |
the article isn't sure what it's talking about, but it's positive twitter will always be around & have the data if you know how to dig it out |
04:09
🔗
|
chronomex |
ersi: that's a stupid patch, and anyway it won't necessarily get the full set of resources needed to render the page |
04:16
🔗
|
ersi |
chronomex: Patch? What? |
04:16
🔗
|
chronomex |
you know, an ad-hoc fix |
04:16
🔗
|
ersi |
ah, heh |
04:17
🔗
|
chronomex |
besides |
04:17
🔗
|
chronomex |
I've wanted to put a warc-making proxy behind my main webbrowser for a while |
04:17
🔗
|
chronomex |
disk is cheap |
04:17
🔗
|
chronomex |
websites disappear |
04:18
🔗
|
chronomex |
full-text search of every page you've ever viewed would be awesome |
04:18
🔗
|
soultcer |
In that case check out YaCy? |
04:19
🔗
|
chronomex |
not really what I'm interested in |
04:19
🔗
|
chronomex |
why don't you let me make what I want to make in peace |
04:21
🔗
|
soultcer |
Haha but I love suggesting almost relevant software to you |
04:22
🔗
|
chronomex |
yes |
04:23
🔗
|
chronomex |
it's a 85% buzzword-match |
04:23
🔗
|
chronomex |
suggest it! |
07:09
🔗
|
tef_ |
chronomex: btw you can write resource records instead of request/response records for warcs |
07:09
🔗
|
chronomex |
what's the advantage of that? |
07:09
🔗
|
tef_ |
chronomex: turns out I overlooked part of the spec, but really resource (files sans http traffic) is really meant for legacy record conversion and other protocols |
07:10
🔗
|
tef_ |
part of the pain in writing a proxy is that you have to grab the http traffic |
07:10
🔗
|
tef_ |
as most of the software revolves around hiding them/not logging them |
07:10
🔗
|
chronomex |
ah |
07:10
🔗
|
chronomex |
yeah |
07:10
🔗
|
tef_ |
couldn't find a nice way to hack requests for instance |
07:11
🔗
|
tef_ |
https://github.com/tef/warctools/tree/master/hanzo/httptools |
07:11
🔗
|
tef_ |
sooo I wrote my own http parser in python |
07:11
🔗
|
tef_ |
which can be used for both flat files and sockets |
07:13
🔗
|
tef_ |
chronomex: but yeah you may have more luck managing it that way |
07:14
🔗
|
tef_ |
I did look at the proxy rewriting protocol myself - can't remember why I avoided it, seemed like too much work |
07:15
🔗
|
tef_ |
chronomex: your other option is to try hacking it into webkit |
07:15
🔗
|
chronomex |
noooooo |
07:15
🔗
|
tef_ |
c.f http://browsertoolkit.com/fault-tolerance.png |
07:16
🔗
|
chronomex |
hahaha |
07:16
🔗
|
chronomex |
YUP |
07:16
🔗
|
chronomex |
I don't quite see the problem with a simple proxy that stuffs every request/response to disk |
07:16
🔗
|
chronomex |
but maybe I'm just stupid and like to think simplisically |
07:17
🔗
|
tef_ |
there isn't one |
07:17
🔗
|
tef_ |
the only problem is that almost all proxies are not designed to log the entire traffic |
07:17
🔗
|
chronomex |
right |
07:17
🔗
|
tef_ |
maybe you could write an ethercap plugin :v |
07:17
🔗
|
chronomex |
why not just write a proxy? |
07:17
🔗
|
tef_ |
you can |
07:17
🔗
|
tef_ |
it's relatively straight forward, except for http parsing |
07:18
🔗
|
tef_ |
oh and doing a man in the middle attack on ssl :-) |
07:18
🔗
|
chronomex |
heh |
07:18
🔗
|
chronomex |
ssl is out of scope for this project |
07:18
🔗
|
tef_ |
oh |
07:18
🔗
|
tef_ |
what about pipelining :-) |
07:19
🔗
|
chronomex |
heh |
07:19
🔗
|
* |
chronomex shrugs |
07:19
🔗
|
tef_ |
but yeah, if you want to go down that route, I can help somewhat |
07:19
🔗
|
tef_ |
https://github.com/tef/warctools/blob/master/hanzo/httptools/messaging.py |
07:20
🔗
|
tef_ |
You might have fun rewriting GET http://blah/foo into GET /foo \r\n Host: blah |
07:21
🔗
|
SmileyG |
your blowing my minds. |
07:21
🔗
|
tef_ |
chronomex: it supports reading from text and from fds :3 |
07:21
🔗
|
chronomex |
:O |
07:24
🔗
|
tef_ |
https://github.com/tef/warctools/blob/master/hanzo/httptools/messaging.py#L204 |
07:24
🔗
|
tef_ |
this is useful for writing warcs without gzip or chunked encoding :-) |
07:24
🔗
|
chronomex |
I'm not sure if that's appropriate |
07:24
🔗
|
chronomex |
isn't warc supposed to be about preserving the content on the wire? |
07:25
🔗
|
tef_ |
it's ok for a proxy to remove them |
07:25
🔗
|
chronomex |
or is it more about preserving the meaningful content |
07:25
🔗
|
tef_ |
proxies are allowed to vary tranfer-encoding, and content-encoding too iirc |
07:25
🔗
|
chronomex |
sure |
07:25
🔗
|
chronomex |
you're talking about a proxy |
07:25
🔗
|
chronomex |
I'm talking about archiving |
07:25
🔗
|
tef_ |
if you like you can write warcs as is, sure |
07:25
🔗
|
chronomex |
why does nobody on the internet understand me today |
07:26
🔗
|
tef_ |
we had legacy issues preventing doing it like so |
07:26
🔗
|
tef_ |
it's also nice for canonicalizing |
07:26
🔗
|
tef_ |
Technically you *should* mutate the headers too, as proxies have to munge incoming requests somewhat |
07:26
🔗
|
chronomex |
right |
07:26
🔗
|
tef_ |
as it turns out not all webservers like full URLs in the method line |
07:27
🔗
|
tef_ |
but your web browser will be sending GET http://... to the proxy |
07:27
🔗
|
chronomex |
that's not surprising |
07:27
🔗
|
chronomex |
well, yeah |
07:27
🔗
|
chronomex |
and have to add a Host: header maybe |
07:27
🔗
|
tef_ |
luckily most seem to add a Host: header too, so you only need to strip the first line |
07:28
🔗
|
tef_ |
the most annoying thing for me was the logic of when to pipeline and when to wait for close |
07:29
🔗
|
chronomex |
mmm |
07:29
🔗
|
tef_ |
oh and weird little bugs like people who miss out phrases in response headers, i.e '200 OK' |
07:30
🔗
|
tef_ |
but yeah, it's a complete and tested http parsing library :3 |
07:30
🔗
|
chronomex |
thx |
07:30
🔗
|
tef_ |
may/may not help you somewhat |
07:32
🔗
|
tef_ |
req = httptools.RequestMessage(), req.feed(....), req.close(), |
07:32
🔗
|
tef_ |
resp = httptools.ResponseMessage(req), resp.feed(...), resp.close(), |
07:34
🔗
|
tef_ |
and things like .complete(), headers_complete(), |
07:34
🔗
|
tef_ |
it's not brilliant :v it is still bad for using bytearray for a buffer aaand using append() |
07:35
🔗
|
tef_ |
btw you need the req to build a response, because for HEAD requests there will be no body to the response |
07:35
🔗
|
tef_ |
*shakes fist at http* |
07:36
🔗
|
chronomex |
arrr |
07:38
🔗
|
tef_ |
i'm annoyed I can't officially commit this to warctools due to politics |
07:38
🔗
|
chronomex |
:( |
07:38
🔗
|
tef_ |
I can on the other hand make it reaaaly easy |
07:38
🔗
|
chronomex |
anyway, enough tv, bedtime now |
07:38
🔗
|
tef_ |
night! |
09:31
🔗
|
tef_ |
alard: I see you've made chances to warctools - warctozip.py - would you have anything against them being commited to the hanzo repository ? |
09:32
🔗
|
tef_ |
alard: assuming you're happy with MIT licensing |
09:41
🔗
|
alard |
tef_: Which changes? |
09:42
🔗
|
tef_ |
https://github.com/alard/warctozip |
09:44
🔗
|
tef_ |
alard: warctozip.py basically |
09:44
🔗
|
tef_ |
unless you've got other fixes or improvements |
09:44
🔗
|
alard |
Ah, okay. No, I don't remember any other changes. |
09:45
🔗
|
tef_ |
(my hands are tied on implementing some changes, but i'm free to import anyone elses) |
09:46
🔗
|
tef_ |
so I might add warctozip.py to it, and maybe make it behave like the others with the same options |
09:46
🔗
|
alard |
Well, that's not the best version of the script. There are a few useful improvements in https://github.com/alard/warctozip-service/ |
09:46
🔗
|
alard |
Things like timestamps, better url-to-filename conversion. |
09:47
🔗
|
alard |
I could add those things to warctozip.py and send you the improved version later. |
09:47
🔗
|
tef_ |
awesome |
09:50
🔗
|
tef_ |
could probably pull out a function and stick it in warc.py write_zip(record, zip_file) |
09:56
🔗
|
alard |
I'll have at combining the scripts later. |
10:20
🔗
|
godane |
i got a star wars last dinner picture from underground gamer |
21:08
🔗
|
SketchCow |
I have been given 25% more pipeline for submitting magazines |