#archiveteam 2012-09-21,Fri

↑back Search

Time	Nickname	Message
00:46 ^🔗	SketchCow	SAN FRANCISCO, Sept. 20, 2012 -- Salon Media Group (SLNM.PK) and The Well Group,
00:46 ^🔗	SketchCow	Inc. today jointly announced that The WELL is now under the ownership of The Well Group,
00:46 ^🔗	SketchCow	Inc., a private investment group composed of long-time WELL members.
00:46 ^🔗	SketchCow
00:46 ^🔗	SketchCow	The Well Group, Inc. consists entirely of long-time WELL users with an average tenure
00:46 ^🔗	SketchCow	exceeding 20 years. The purchase marks the first major online business taken private by users of
00:46 ^🔗	SketchCow	the business itself.
00:46 ^🔗	SketchCow	....... that was unexpected
01:04 ^🔗	*	ivan` just discovers http://www.archiveteam.org/index.php?title=Wget_with_Lua_hooks - very cool
01:10 ^🔗	ivan`	http://largedownloads.ea.com/pub/ might be of interest to someone here; I have no disk space
01:14 ^🔗	ivan`	DFJustin: http://www.youtube.com/user/Tork110/videos has a lot of great game footage as well
02:53 ^🔗	closure	well, that's awesome. A panic grab of the Well was always going to suck mightily
02:55 ^🔗	closure	btw, I grabbed xkcd today. I doubt the comic 1111 = last rumors after yesterday's comic, but better safe than sorry
02:59 ^🔗	closure	hmm, speaking of Salon ... http://www.salon.com/2012/09/20/history_as_recorded_on_twitter_is_vanishing/
03:33 ^🔗	ersi	chronomex: You could just parse through the WARC later, finding all the src="" and do another pass on those URLs
03:49 ^🔗	Lord_Nigh	i thought twitter donated their entire backlog of tweets up to 2010 or 2011 to the library of congress
03:49 ^🔗	Lord_Nigh	i guess everything past then may decay though
03:51 ^🔗	DFJustin	article is talking about the external links
03:57 ^🔗	closure	the article isn't sure what it's talking about, but it's positive twitter will always be around & have the data if you know how to dig it out
04:09 ^🔗	chronomex	ersi: that's a stupid patch, and anyway it won't necessarily get the full set of resources needed to render the page
04:16 ^🔗	ersi	chronomex: Patch? What?
04:16 ^🔗	chronomex	you know, an ad-hoc fix
04:16 ^🔗	ersi	ah, heh
04:17 ^🔗	chronomex	besides
04:17 ^🔗	chronomex	I've wanted to put a warc-making proxy behind my main webbrowser for a while
04:17 ^🔗	chronomex	disk is cheap
04:17 ^🔗	chronomex	websites disappear
04:18 ^🔗	chronomex	full-text search of every page you've ever viewed would be awesome
04:18 ^🔗	soultcer	In that case check out YaCy?
04:19 ^🔗	chronomex	not really what I'm interested in
04:19 ^🔗	chronomex	why don't you let me make what I want to make in peace
04:21 ^🔗	soultcer	Haha but I love suggesting almost relevant software to you
04:22 ^🔗	chronomex	yes
04:23 ^🔗	chronomex	it's a 85% buzzword-match
04:23 ^🔗	chronomex	suggest it!
07:09 ^🔗	tef_	chronomex: btw you can write resource records instead of request/response records for warcs
07:09 ^🔗	chronomex	what's the advantage of that?
07:09 ^🔗	tef_	chronomex: turns out I overlooked part of the spec, but really resource (files sans http traffic) is really meant for legacy record conversion and other protocols
07:10 ^🔗	tef_	part of the pain in writing a proxy is that you have to grab the http traffic
07:10 ^🔗	tef_	as most of the software revolves around hiding them/not logging them
07:10 ^🔗	chronomex	ah
07:10 ^🔗	chronomex	yeah
07:10 ^🔗	tef_	couldn't find a nice way to hack requests for instance
07:11 ^🔗	tef_	https://github.com/tef/warctools/tree/master/hanzo/httptools
07:11 ^🔗	tef_	sooo I wrote my own http parser in python
07:11 ^🔗	tef_	which can be used for both flat files and sockets
07:13 ^🔗	tef_	chronomex: but yeah you may have more luck managing it that way
07:14 ^🔗	tef_	I did look at the proxy rewriting protocol myself - can't remember why I avoided it, seemed like too much work
07:15 ^🔗	tef_	chronomex: your other option is to try hacking it into webkit
07:15 ^🔗	chronomex	noooooo
07:15 ^🔗	tef_	c.f http://browsertoolkit.com/fault-tolerance.png
07:16 ^🔗	chronomex	hahaha
07:16 ^🔗	chronomex	YUP
07:16 ^🔗	chronomex	I don't quite see the problem with a simple proxy that stuffs every request/response to disk
07:16 ^🔗	chronomex	but maybe I'm just stupid and like to think simplisically
07:17 ^🔗	tef_	there isn't one
07:17 ^🔗	tef_	the only problem is that almost all proxies are not designed to log the entire traffic
07:17 ^🔗	chronomex	right
07:17 ^🔗	tef_	maybe you could write an ethercap plugin :v
07:17 ^🔗	chronomex	why not just write a proxy?
07:17 ^🔗	tef_	you can
07:17 ^🔗	tef_	it's relatively straight forward, except for http parsing
07:18 ^🔗	tef_	oh and doing a man in the middle attack on ssl :-)
07:18 ^🔗	chronomex	heh
07:18 ^🔗	chronomex	ssl is out of scope for this project
07:18 ^🔗	tef_	oh
07:18 ^🔗	tef_	what about pipelining :-)
07:19 ^🔗	chronomex	heh
07:19 ^🔗	*	chronomex shrugs
07:19 ^🔗	tef_	but yeah, if you want to go down that route, I can help somewhat
07:19 ^🔗	tef_	https://github.com/tef/warctools/blob/master/hanzo/httptools/messaging.py
07:20 ^🔗	tef_	You might have fun rewriting GET http://blah/foo into GET /foo \r\n Host: blah
07:21 ^🔗	SmileyG	your blowing my minds.
07:21 ^🔗	tef_	chronomex: it supports reading from text and from fds :3
07:21 ^🔗	chronomex	:O
07:24 ^🔗	tef_	https://github.com/tef/warctools/blob/master/hanzo/httptools/messaging.py#L204
07:24 ^🔗	tef_	this is useful for writing warcs without gzip or chunked encoding :-)
07:24 ^🔗	chronomex	I'm not sure if that's appropriate
07:24 ^🔗	chronomex	isn't warc supposed to be about preserving the content on the wire?
07:25 ^🔗	tef_	it's ok for a proxy to remove them
07:25 ^🔗	chronomex	or is it more about preserving the meaningful content
07:25 ^🔗	tef_	proxies are allowed to vary tranfer-encoding, and content-encoding too iirc
07:25 ^🔗	chronomex	sure
07:25 ^🔗	chronomex	you're talking about a proxy
07:25 ^🔗	chronomex	I'm talking about archiving
07:25 ^🔗	tef_	if you like you can write warcs as is, sure
07:25 ^🔗	chronomex	why does nobody on the internet understand me today
07:26 ^🔗	tef_	we had legacy issues preventing doing it like so
07:26 ^🔗	tef_	it's also nice for canonicalizing
07:26 ^🔗	tef_	Technically you should mutate the headers too, as proxies have to munge incoming requests somewhat
07:26 ^🔗	chronomex	right
07:26 ^🔗	tef_	as it turns out not all webservers like full URLs in the method line
07:27 ^🔗	tef_	but your web browser will be sending GET http://... to the proxy
07:27 ^🔗	chronomex	that's not surprising
07:27 ^🔗	chronomex	well, yeah
07:27 ^🔗	chronomex	and have to add a Host: header maybe
07:27 ^🔗	tef_	luckily most seem to add a Host: header too, so you only need to strip the first line
07:28 ^🔗	tef_	the most annoying thing for me was the logic of when to pipeline and when to wait for close
07:29 ^🔗	chronomex	mmm
07:29 ^🔗	tef_	oh and weird little bugs like people who miss out phrases in response headers, i.e '200 OK'
07:30 ^🔗	tef_	but yeah, it's a complete and tested http parsing library :3
07:30 ^🔗	chronomex	thx
07:30 ^🔗	tef_	may/may not help you somewhat
07:32 ^🔗	tef_	req = httptools.RequestMessage(), req.feed(....), req.close(),
07:32 ^🔗	tef_	resp = httptools.ResponseMessage(req), resp.feed(...), resp.close(),
07:34 ^🔗	tef_	and things like .complete(), headers_complete(),
07:34 ^🔗	tef_	it's not brilliant :v it is still bad for using bytearray for a buffer aaand using append()
07:35 ^🔗	tef_	btw you need the req to build a response, because for HEAD requests there will be no body to the response
07:35 ^🔗	tef_	shakes fist at http
07:36 ^🔗	chronomex	arrr
07:38 ^🔗	tef_	i'm annoyed I can't officially commit this to warctools due to politics
07:38 ^🔗	chronomex	:(
07:38 ^🔗	tef_	I can on the other hand make it reaaaly easy
07:38 ^🔗	chronomex	anyway, enough tv, bedtime now
07:38 ^🔗	tef_	night!
09:31 ^🔗	tef_	alard: I see you've made chances to warctools - warctozip.py - would you have anything against them being commited to the hanzo repository ?
09:32 ^🔗	tef_	alard: assuming you're happy with MIT licensing
09:41 ^🔗	alard	tef_: Which changes?
09:42 ^🔗	tef_	https://github.com/alard/warctozip
09:44 ^🔗	tef_	alard: warctozip.py basically
09:44 ^🔗	tef_	unless you've got other fixes or improvements
09:44 ^🔗	alard	Ah, okay. No, I don't remember any other changes.
09:45 ^🔗	tef_	(my hands are tied on implementing some changes, but i'm free to import anyone elses)
09:46 ^🔗	tef_	so I might add warctozip.py to it, and maybe make it behave like the others with the same options
09:46 ^🔗	alard	Well, that's not the best version of the script. There are a few useful improvements in https://github.com/alard/warctozip-service/
09:46 ^🔗	alard	Things like timestamps, better url-to-filename conversion.
09:47 ^🔗	alard	I could add those things to warctozip.py and send you the improved version later.
09:47 ^🔗	tef_	awesome
09:50 ^🔗	tef_	could probably pull out a function and stick it in warc.py write_zip(record, zip_file)
09:56 ^🔗	alard	I'll have at combining the scripts later.
10:20 ^🔗	godane	i got a star wars last dinner picture from underground gamer
21:08 ^🔗	SketchCow	I have been given 25% more pipeline for submitting magazines

irclogger-viewer