#archiveteam 2012-01-06,Fri

↑back Search

Time	Nickname	Message
00:52 ^🔗	yipdw	oh nice
00:52 ^🔗	yipdw	http://archive-access.sourceforge.net/projects/wayback/hadoop.html
00:52 ^🔗	yipdw	it looks like that's the WARC ingester/browser I've been looking for
00:53 ^🔗	*	yipdw gives it a shot
00:56 ^🔗	chronomex	don't shoot!
00:58 ^🔗	underscor	<Frooxius> It's one of my catchphrases
00:58 ^🔗	underscor	<Frooxius> it's missing a tryphrase though
00:58 ^🔗	yipdw	that was lamebda
00:58 ^🔗	*	underscor cries
00:59 ^🔗	yipdw	i can't come up with a continuation to that though
01:07 ^🔗	yipdw	ugh
01:07 ^🔗	yipdw	maybe I'm just kinda sleepy and/or stupid but I cannot figure out how to upload a WARC file to wayback
01:08 ^🔗	yipdw	the administrator manual is quite dense too
01:09 ^🔗	yipdw	oh, ugh, I have to write XML
01:10 ^🔗	underscor	yipdw: I find it easier to just extract the contents with warctool
01:10 ^🔗	yipdw	underscor: yeah, but I can't easily see what I got that way
01:10 ^🔗	yipdw	flat files are a terrible way to check hypertext
01:10 ^🔗	underscor	You can see which files, but yeah
01:10 ^🔗	yipdw	https://github.com/openplanets/wap/tree/master/ArchiveExplorer looks like it's what I want
01:11 ^🔗	yipdw	...sort of
01:14 ^🔗	yipdw	ha, damn, proust uses a lot of separate image files
01:15 ^🔗	yipdw	I'm surprised that they didn't use one of the many many solutions out there for CSS spriting
01:15 ^🔗	yipdw	oh wait no
01:16 ^🔗	yipdw	this ArchiveExplorer is garbage
01:16 ^🔗	yipdw	ok, this is what I want
01:16 ^🔗	yipdw	I want a tool that takes as input the path to a *.warc.gz file
01:16 ^🔗	yipdw	starts up a webserver
01:16 ^🔗	yipdw	and then mounts the warc.gz at /
01:16 ^🔗	yipdw	you can then access the started webserver via any web browser.
01:16 ^🔗	yipdw	how hard can it be?
01:18 ^🔗	yipdw	let's find out
01:19 ^🔗	yipdw	time to yak shave
01:21 ^🔗	yipdw	underscor: what's warctool?
01:21 ^🔗	yipdw	I don't see it in the hanzo repo
01:22 ^🔗	yipdw	it just occurred to me that if that can recreate some sort of directory structure, then I can get pretty far by just running thin on that
01:23 ^🔗	chronomex	yipdw: wouldn't you have to decompress the .gz ? I'm pretty sure gzip is not seekable by default
01:23 ^🔗	yipdw	chronomex: not sure of the implementation details yet; haven't actually tried building this
01:23 ^🔗	yipdw	i want to see if I can take the lazy way out by decompressing and extracting contents
01:23 ^🔗	chronomex	probably
01:24 ^🔗	yipdw	but the important part for me is being able to verify the archive in a browser
01:34 ^🔗	yipdw	actually, just running thin on the non-WARC bits seems to work reasonably ok
01:34 ^🔗	yipdw	it doesn't load all images but I can verify that those exist
01:35 ^🔗	underscor	There's something special about warc.gz files, where you can seek them and it just worksâ¢
01:35 ^🔗	underscor	But I think that only applies to heretrix warcs
01:35 ^🔗	underscor	yipdw: http://warc-tools.googlecode.com/svn/trunk/app/python/
01:36 ^🔗	underscor	I think warcdump.py does what you want
01:37 ^🔗	yipdw	oh, I thought that repo was deprecated
01:37 ^🔗	yipdw	well then
01:37 ^🔗	yipdw	I'll give that a try
01:37 ^🔗	yipdw	but first, time to cook stuff
01:37 ^🔗	underscor	it worked for me
01:37 ^🔗	underscor	but that was in july
01:38 ^🔗	underscor	so the wget-warc spec might have changed
02:10 ^🔗	yipdw	mmm, sockeye salmon and quinoa
04:47 ^🔗	SketchCow	Hey, guys.
04:47 ^🔗	SketchCow	Oh my god, what an amazing hotel and location.
04:53 ^🔗	Coderjoe	underscor: that should work for wget-warc files as well. it has the code to put the offset or length or whatever in the gzip extrainfo field
04:59 ^🔗	Coderjoe	it is done in a rather hackish way, too, IMO
05:28 ^🔗	ce	ciao a tutti
05:28 ^🔗	chronomex	ciao
05:29 ^🔗	ce	buon anno
05:33 ^🔗	yipdw	argh
05:33 ^🔗	yipdw	if anyone can actually build this -> https://github.com/openplanets/wap/tree/master/ArchiveFS
05:33 ^🔗	yipdw	let me know
05:34 ^🔗	yipdw	I hate it when code that is supposedly up-to-date depends on dependencies that are five or six years old
05:34 ^🔗	yipdw	it means that that code is actually not up-to-date
05:34 ^🔗	chronomex	ce: bit late for that, eh? :)
05:46 ^🔗	yipdw	oh god, guys, really
05:46 ^🔗	yipdw	the authors of this Maven file made it download a hundred or so dependencies
05:46 ^🔗	yipdw	everything BUT a Fuse-J build
05:47 ^🔗	yipdw	so they included automatic resolution of all the trivial dependencies
05:55 ^🔗	yipdw	man, the more I look for a way to inspect WARCs, the more I am convinced that nobody actually has a way to do it
05:55 ^🔗	yipdw	this is a bit unsettling
05:57 ^🔗	yipdw	or, more precisely, many people have written their own ways to do it, but those ways are crazy ad-hoc and require a very specific environment
07:01 ^🔗	yipdw	ah!
07:01 ^🔗	yipdw	I got wayback up and running
07:02 ^🔗	yipdw	let's see what happens if I shove some WARCs at it
07:34 ^🔗	yipdw	wholy fuck
07:34 ^🔗	yipdw	it works
07:34 ^🔗	chronomex	o_o
07:34 ^🔗	chronomex	rad
07:35 ^🔗	yipdw	http://depot.ninjawedding.org/wayback.png
07:36 ^🔗	yipdw	this is rad
07:36 ^🔗	yipdw	"Think Proust is neat? "
07:36 ^🔗	yipdw	no
07:36 ^🔗	yipdw	think wayback is neat, yes
07:36 ^🔗	yipdw	at least once you get it working
07:41 ^🔗	yipdw	hmm, that's weird. I have resources in the WARC under proust.com/i/a22/... that are referenced from proust.com/story
07:41 ^🔗	yipdw	but the Wayback Machine can't find them
07:41 ^🔗	yipdw	dunno what that means
07:42 ^🔗	yipdw	oh, they're not actually in the wARC
07:42 ^🔗	yipdw	ok
08:25 ^🔗	yipdw	hmm, here's a stumper re: Javascript and archival of pages
08:25 ^🔗	yipdw	the Wayback Machine's viewer frame includes jQuery 1.3.2
08:25 ^🔗	yipdw	many sites make use of things like $(...).delegate()
08:25 ^🔗	yipdw	which is not supported in jQuery 1.3.2
08:25 ^🔗	yipdw	and actually, Wayback's viewer is not an iframe
08:25 ^🔗	chronomex	HMMMM.
08:25 ^🔗	yipdw	so there's a jQuery conflict
08:26 ^🔗	yipdw	I just noticed this while viewing Proust grabs
08:26 ^🔗	yipdw	not sure how to fix it, aside from filing it as a Wayback Machine bug
08:28 ^🔗	yipdw	also, does it make sense to archive things like http://www.google-analytics.com/ga.js if they're included in a page?
08:29 ^🔗	yipdw	I mean, on the one hand it IS part of the page
08:29 ^🔗	yipdw	but on the other:
08:29 ^🔗	yipdw	(1) Google Analytics is fucking evil
08:29 ^🔗	yipdw	(2) there is no way to block GA served via archive.org
08:29 ^🔗	yipdw	without collateral damage to other scripts served from archive.org (or any Wayback server)
08:30 ^🔗	chronomex	why is GA evil?
08:31 ^🔗	yipdw	my perspective
08:31 ^🔗	chronomex	sure, what makes you think so?
08:31 ^🔗	yipdw	it reminds me too much of Deus Ex's Daedalus
08:32 ^🔗	yipdw	and other such panopticons
08:32 ^🔗	chronomex	I'm not familiar with this Daedalus, care to explain?
08:32 ^🔗	yipdw	Daedalus, in Deus Ex, was a distributed system that reported on all communications to the Illuminati
08:32 ^🔗	chronomex	ah
08:32 ^🔗	yipdw	I never said it was a rational dislike :P
08:33 ^🔗	yipdw	but I think point (2) is the more important one anyway
08:33 ^🔗	chronomex	did I say anything about you being irrational?
08:33 ^🔗	chronomex	it seems like you are insecure in your beliefs
08:33 ^🔗	yipdw	no, just throwing that out there
08:33 ^🔗	chronomex	if you believe it, own it. don't half-believe.
08:33 ^🔗	yipdw	namely, that if someone has GA block rules set up via noscript, then this sort of thing is a runaround
08:34 ^🔗	chronomex	that's true. perhaps you should email info@archive.org.
08:34 ^🔗	chronomex	but consider this
08:34 ^🔗	chronomex	wayback machine scraper saves everything including the HTTP headers. why would they alter the page itself?
08:34 ^🔗	yipdw	well, the scraper doesn't alter the page
08:34 ^🔗	chronomex	corrcet
08:35 ^🔗	chronomex	I bet you could gin up some sort of noscript rules
08:35 ^🔗	yipdw	I'm going to suck in all external javascripts, too
08:35 ^🔗	chronomex	(I know nothing about noscript)
08:35 ^🔗	yipdw	but
08:35 ^🔗	yipdw	when the page is presented via the Wayback Machine, CSS and Javascript references are rewritten
08:35 ^🔗	chronomex	mandang, fucking amazon charging me sales tax
08:36 ^🔗	chronomex	it's not like I live in SEATTLE or anything
08:36 ^🔗	Coderjoe	I also do not like GA, simply from the panopticon-like side.
08:36 ^🔗	chronomex	oh wait, I do. I can see their old HQ from my house.
08:36 ^🔗	Coderjoe	you think they just track you on one site? HA. they track you everywhere you go that uses GA
08:36 ^🔗	chronomex	I like GA, and I wish it did more.
08:37 ^🔗	chronomex	though I don't really use it much, I tend to lean harder on getclicky.com
08:37 ^🔗	yipdw	oh, different audiences there
08:37 ^🔗	chronomex	:P
08:37 ^🔗	yipdw	if I had to use GA for, well, analytics, yeah, I'd love it
08:37 ^🔗	yipdw	but that's because in that case I Have The Power
08:37 ^🔗	yipdw	:P
08:38 ^🔗	chronomex	web's quite different when you're on the other side of things
08:38 ^🔗	Coderjoe	why not just use a log analysis tool to get the stats?
08:38 ^🔗	yipdw	yes
08:38 ^🔗	chronomex	changes your perspective rather forcibly
08:38 ^🔗	yipdw	I've abused that power quite often
08:38 ^🔗	yipdw	like changing images on my site that get lots of external references
08:38 ^🔗	chronomex	Coderjoe: will that get you screen resolution or amount of time the user kept the page open before leaving?
08:39 ^🔗	Coderjoe	the guy running rosettacode.org has noticed that GA only shows around 50% or so of his traffic (comparing to log files)
08:39 ^🔗	chronomex	GA samples stochastically when you get above some threshold
08:39 ^🔗	Coderjoe	yipdw: ever given the hello.jpg to those bandwidth thieves?
08:40 ^🔗	yipdw	Coderjoe: no, but I advise you not to browse animemusicvideos.org/forum at work anymore
08:40 ^🔗	chronomex	hahahaha
08:40 ^🔗	yipdw	there are several replacements of facepalm-256.gif with penises
08:40 ^🔗	Coderjoe	O_o
08:40 ^🔗	chronomex	excellent
08:40 ^🔗	Coderjoe	yay
08:41 ^🔗	chronomex	world needs more penises
08:41 ^🔗	chronomex	3.5 billion isn't enough
08:44 ^🔗	chronomex	not by a long shot
08:44 ^🔗	yipdw	there must be more than 3.5 billion penises in the world
08:44 ^🔗	chronomex	it's about the right order of magnitude
08:45 ^🔗	yipdw	what about all those insects
08:46 ^🔗	chronomex	oh, I was just counting humans
08:46 ^🔗	yipdw	but, uh
08:46 ^🔗	yipdw	dicks aside
08:47 ^🔗	yipdw	wayback rewrites script src URLs from e.g. http://platform.twitter.com/widgets.js to something like http://192.168.122.246:8080/20120106083222js_/http://platform.twitter.com/widgets.js
08:47 ^🔗	yipdw	makes it tricky
08:48 ^🔗	yipdw	I'll grab them all; proust seems to break in weird ways if some externals aren't present
11:36 ^🔗	ersi	Mmmh, DOS games
18:43 ^🔗	yipdw	huh, what the fuck
18:43 ^🔗	yipdw	FileUtils#mkdir_p is returning Errno::EEXIST
18:43 ^🔗	yipdw	that...should not happen
18:44 ^🔗	yipdw	oH!
18:44 ^🔗	yipdw	because it's a dead symlink
18:44 ^🔗	yipdw	oops
19:02 ^🔗	yipdw	also, if anyone here is running a Redis instance, I'd appreciate help running the discovery scripts in https://github.com/ArchiveTeam/proust-pulling
19:02 ^🔗	yipdw	it seems the best shot for converging on a full set of users is to just run those over and over
19:13 ^🔗	yipdw	wait, never mind, I'm a moron
19:13 ^🔗	yipdw	I jsut ralized proust publishes a sitemap that includes all public stories
19:13 ^🔗	yipdw	and that I cant' type
19:20 ^🔗	Coderjoe	I mentioned the sitemap, but I think it hasn't been updated in awhile
19:20 ^🔗	Coderjoe	there were some new stories on the people page that I don't think were in the sitemap
19:31 ^🔗	QUICKSTRE	FILM STREAMING! http://quickstream.altervista.org/
19:31 ^🔗	QUICKSTRE	FILM STREAMING! http://quickstream.altervista.org/
19:31 ^🔗	QUICKSTRE	FILM STREAMING! http://quickstream.altervista.org/
19:31 ^🔗	QUICKSTRE	FILM STREAMING! http://quickstream.altervista.org/
19:31 ^🔗	QUICKSTRE	FILM STREAMING! http://quickstream.altervista.org/
19:31 ^🔗	QUICKSTRE	FILM STREAMING! http://quickstream.altervista.org/
19:31 ^🔗	QUICKSTRE	FILM STREAMING! http://quickstream.altervista.org/
19:31 ^🔗	QUICKSTRE	FILM STREAMING! http://quickstream.altervista.org/
19:31 ^🔗	nitro2k01	F! F! S!
19:31 ^🔗	yipdw	Coderjoe: yeah, but I can combine it with the existing sources
19:32 ^🔗	Coderjoe	yipdw: yeah, that's what I was expecting would happen. at least pulling people from it.
19:32 ^🔗	yipdw	proust's robots.txt also disallows /js and /css
19:32 ^🔗	yipdw	I'm debating whether or not to abide by that
19:32 ^🔗	yipdw	standard protocol here is, I think, "fuck it"
19:33 ^🔗	yipdw	but those are pretty important to having a useful copy
19:33 ^🔗	nitro2k01	If you're "lawful good", just download those files manually. Can't be many of them?
19:33 ^🔗	nitro2k01	Not that it would matter right?
19:33 ^🔗	yipdw	that would work if I was downloading the whole site in one go
19:33 ^🔗	yipdw	which I do have a script for
19:34 ^🔗	yipdw	but there's no real way to do that on a per-user basis that doesn't involve a robot
19:34 ^🔗	nitro2k01	Miroring a Yahoo Group worked a charm for an open group
19:34 ^🔗	yipdw	it's more like: on one hand, no, the CSS and JS are not important wrt the content is concerned
19:34 ^🔗	nitro2k01	Still need to log in for a members-only one
19:35 ^🔗	nitro2k01	(Using an ugly wget hack)
19:35 ^🔗	yipdw	on the other hand humans will be looking at at least part of this dataset and presentation then matters quite a bit
20:21 ^🔗	balrog_	hey all
20:21 ^🔗	balrog_	SketchCow: you there?
23:16 ^🔗	salvo	!list
23:38 ^🔗	chronomex	we should have an automatic kick on !list
23:38 ^🔗	chronomex	or maybe an autoresponder with our list of torrents
23:56 ^🔗	dnova	the.GIMP.ultimate.inc.keygen.DiViNiTy
23:59 ^🔗	PatC	Anyone know if 28c3 is up on archive.org yet?

irclogger-viewer