#archiveteam 2012-09-21,Fri

↑back Search

Time Nickname Message
00:46 🔗 SketchCow SAN FRANCISCO, Sept. 20, 2012 -- Salon Media Group (SLNM.PK) and The Well Group,
00:46 🔗 SketchCow Inc. today jointly announced that The WELL is now under the ownership of The Well Group,
00:46 🔗 SketchCow Inc., a private investment group composed of long-time WELL members.
00:46 🔗 SketchCow
00:46 🔗 SketchCow The Well Group, Inc. consists entirely of long-time WELL users with an average tenure
00:46 🔗 SketchCow exceeding 20 years. The purchase marks the first major online business taken private by users of
00:46 🔗 SketchCow the business itself.
00:46 🔗 SketchCow ....... that was unexpected
01:04 🔗 * ivan` just discovers http://www.archiveteam.org/index.php?title=Wget_with_Lua_hooks - very cool
01:10 🔗 ivan` http://largedownloads.ea.com/pub/ might be of interest to someone here; I have no disk space
01:14 🔗 ivan` DFJustin: http://www.youtube.com/user/Tork110/videos has a lot of great game footage as well
02:53 🔗 closure well, that's awesome. A panic grab of the Well was always going to suck mightily
02:55 🔗 closure btw, I grabbed xkcd today. I doubt the comic 1111 = last rumors after yesterday's comic, but better safe than sorry
02:59 🔗 closure hmm, speaking of Salon ... http://www.salon.com/2012/09/20/history_as_recorded_on_twitter_is_vanishing/
03:33 🔗 ersi chronomex: You could just parse through the WARC later, finding all the src="" and do another pass on those URLs
03:49 🔗 Lord_Nigh i thought twitter donated their entire backlog of tweets up to 2010 or 2011 to the library of congress
03:49 🔗 Lord_Nigh i guess everything past then may decay though
03:51 🔗 DFJustin article is talking about the external links
03:57 🔗 closure the article isn't sure what it's talking about, but it's positive twitter will always be around & have the data if you know how to dig it out
04:09 🔗 chronomex ersi: that's a stupid patch, and anyway it won't necessarily get the full set of resources needed to render the page
04:16 🔗 ersi chronomex: Patch? What?
04:16 🔗 chronomex you know, an ad-hoc fix
04:16 🔗 ersi ah, heh
04:17 🔗 chronomex besides
04:17 🔗 chronomex I've wanted to put a warc-making proxy behind my main webbrowser for a while
04:17 🔗 chronomex disk is cheap
04:17 🔗 chronomex websites disappear
04:18 🔗 chronomex full-text search of every page you've ever viewed would be awesome
04:18 🔗 soultcer In that case check out YaCy?
04:19 🔗 chronomex not really what I'm interested in
04:19 🔗 chronomex why don't you let me make what I want to make in peace
04:21 🔗 soultcer Haha but I love suggesting almost relevant software to you
04:22 🔗 chronomex yes
04:23 🔗 chronomex it's a 85% buzzword-match
04:23 🔗 chronomex suggest it!
07:09 🔗 tef_ chronomex: btw you can write resource records instead of request/response records for warcs
07:09 🔗 chronomex what's the advantage of that?
07:09 🔗 tef_ chronomex: turns out I overlooked part of the spec, but really resource (files sans http traffic) is really meant for legacy record conversion and other protocols
07:10 🔗 tef_ part of the pain in writing a proxy is that you have to grab the http traffic
07:10 🔗 tef_ as most of the software revolves around hiding them/not logging them
07:10 🔗 chronomex ah
07:10 🔗 chronomex yeah
07:10 🔗 tef_ couldn't find a nice way to hack requests for instance
07:11 🔗 tef_ https://github.com/tef/warctools/tree/master/hanzo/httptools
07:11 🔗 tef_ sooo I wrote my own http parser in python
07:11 🔗 tef_ which can be used for both flat files and sockets
07:13 🔗 tef_ chronomex: but yeah you may have more luck managing it that way
07:14 🔗 tef_ I did look at the proxy rewriting protocol myself - can't remember why I avoided it, seemed like too much work
07:15 🔗 tef_ chronomex: your other option is to try hacking it into webkit
07:15 🔗 chronomex noooooo
07:15 🔗 tef_ c.f http://browsertoolkit.com/fault-tolerance.png
07:16 🔗 chronomex hahaha
07:16 🔗 chronomex YUP
07:16 🔗 chronomex I don't quite see the problem with a simple proxy that stuffs every request/response to disk
07:16 🔗 chronomex but maybe I'm just stupid and like to think simplisically
07:17 🔗 tef_ there isn't one
07:17 🔗 tef_ the only problem is that almost all proxies are not designed to log the entire traffic
07:17 🔗 chronomex right
07:17 🔗 tef_ maybe you could write an ethercap plugin :v
07:17 🔗 chronomex why not just write a proxy?
07:17 🔗 tef_ you can
07:17 🔗 tef_ it's relatively straight forward, except for http parsing
07:18 🔗 tef_ oh and doing a man in the middle attack on ssl :-)
07:18 🔗 chronomex heh
07:18 🔗 chronomex ssl is out of scope for this project
07:18 🔗 tef_ oh
07:18 🔗 tef_ what about pipelining :-)
07:19 🔗 chronomex heh
07:19 🔗 * chronomex shrugs
07:19 🔗 tef_ but yeah, if you want to go down that route, I can help somewhat
07:19 🔗 tef_ https://github.com/tef/warctools/blob/master/hanzo/httptools/messaging.py
07:20 🔗 tef_ You might have fun rewriting GET http://blah/foo into GET /foo \r\n Host: blah
07:21 🔗 SmileyG your blowing my minds.
07:21 🔗 tef_ chronomex: it supports reading from text and from fds :3
07:21 🔗 chronomex :O
07:24 🔗 tef_ https://github.com/tef/warctools/blob/master/hanzo/httptools/messaging.py#L204
07:24 🔗 tef_ this is useful for writing warcs without gzip or chunked encoding :-)
07:24 🔗 chronomex I'm not sure if that's appropriate
07:24 🔗 chronomex isn't warc supposed to be about preserving the content on the wire?
07:25 🔗 tef_ it's ok for a proxy to remove them
07:25 🔗 chronomex or is it more about preserving the meaningful content
07:25 🔗 tef_ proxies are allowed to vary tranfer-encoding, and content-encoding too iirc
07:25 🔗 chronomex sure
07:25 🔗 chronomex you're talking about a proxy
07:25 🔗 chronomex I'm talking about archiving
07:25 🔗 tef_ if you like you can write warcs as is, sure
07:25 🔗 chronomex why does nobody on the internet understand me today
07:26 🔗 tef_ we had legacy issues preventing doing it like so
07:26 🔗 tef_ it's also nice for canonicalizing
07:26 🔗 tef_ Technically you *should* mutate the headers too, as proxies have to munge incoming requests somewhat
07:26 🔗 chronomex right
07:26 🔗 tef_ as it turns out not all webservers like full URLs in the method line
07:27 🔗 tef_ but your web browser will be sending GET http://... to the proxy
07:27 🔗 chronomex that's not surprising
07:27 🔗 chronomex well, yeah
07:27 🔗 chronomex and have to add a Host: header maybe
07:27 🔗 tef_ luckily most seem to add a Host: header too, so you only need to strip the first line
07:28 🔗 tef_ the most annoying thing for me was the logic of when to pipeline and when to wait for close
07:29 🔗 chronomex mmm
07:29 🔗 tef_ oh and weird little bugs like people who miss out phrases in response headers, i.e '200 OK'
07:30 🔗 tef_ but yeah, it's a complete and tested http parsing library :3
07:30 🔗 chronomex thx
07:30 🔗 tef_ may/may not help you somewhat
07:32 🔗 tef_ req = httptools.RequestMessage(), req.feed(....), req.close(),
07:32 🔗 tef_ resp = httptools.ResponseMessage(req), resp.feed(...), resp.close(),
07:34 🔗 tef_ and things like .complete(), headers_complete(),
07:34 🔗 tef_ it's not brilliant :v it is still bad for using bytearray for a buffer aaand using append()
07:35 🔗 tef_ btw you need the req to build a response, because for HEAD requests there will be no body to the response
07:35 🔗 tef_ *shakes fist at http*
07:36 🔗 chronomex arrr
07:38 🔗 tef_ i'm annoyed I can't officially commit this to warctools due to politics
07:38 🔗 chronomex :(
07:38 🔗 tef_ I can on the other hand make it reaaaly easy
07:38 🔗 chronomex anyway, enough tv, bedtime now
07:38 🔗 tef_ night!
09:31 🔗 tef_ alard: I see you've made chances to warctools - warctozip.py - would you have anything against them being commited to the hanzo repository ?
09:32 🔗 tef_ alard: assuming you're happy with MIT licensing
09:41 🔗 alard tef_: Which changes?
09:42 🔗 tef_ https://github.com/alard/warctozip
09:44 🔗 tef_ alard: warctozip.py basically
09:44 🔗 tef_ unless you've got other fixes or improvements
09:44 🔗 alard Ah, okay. No, I don't remember any other changes.
09:45 🔗 tef_ (my hands are tied on implementing some changes, but i'm free to import anyone elses)
09:46 🔗 tef_ so I might add warctozip.py to it, and maybe make it behave like the others with the same options
09:46 🔗 alard Well, that's not the best version of the script. There are a few useful improvements in https://github.com/alard/warctozip-service/
09:46 🔗 alard Things like timestamps, better url-to-filename conversion.
09:47 🔗 alard I could add those things to warctozip.py and send you the improved version later.
09:47 🔗 tef_ awesome
09:50 🔗 tef_ could probably pull out a function and stick it in warc.py write_zip(record, zip_file)
09:56 🔗 alard I'll have at combining the scripts later.
10:20 🔗 godane i got a star wars last dinner picture from underground gamer
21:08 🔗 SketchCow I have been given 25% more pipeline for submitting magazines

irclogger-viewer