[00:52] oh nice [00:52] http://archive-access.sourceforge.net/projects/wayback/hadoop.html [00:52] it looks like that's the WARC ingester/browser I've been looking for [00:53] * yipdw gives it a shot [00:56] don't shoot! [00:58] It's one of my catchphrases [00:58] it's missing a tryphrase though [00:58] that was lamebda [00:58] * underscor cries [00:59] i can't come up with a continuation to that though [01:07] ugh [01:07] maybe I'm just kinda sleepy and/or stupid but I cannot figure out how to upload a WARC file to wayback [01:08] the administrator manual is quite dense too [01:09] oh, ugh, I have to write XML [01:10] yipdw: I find it easier to just extract the contents with warctool [01:10] underscor: yeah, but I can't easily see what I got that way [01:10] flat files are a terrible way to check hypertext [01:10] You can see which files, but yeah [01:10] https://github.com/openplanets/wap/tree/master/ArchiveExplorer looks like it's what I want [01:11] ...sort of [01:14] ha, damn, proust uses a lot of separate image files [01:15] I'm surprised that they didn't use one of the many many solutions out there for CSS spriting [01:15] oh wait no [01:16] this ArchiveExplorer is garbage [01:16] ok, this is what I want [01:16] I want a tool that takes as input the path to a *.warc.gz file [01:16] starts up a webserver [01:16] and then mounts the warc.gz at / [01:16] you can then access the started webserver via any web browser. [01:16] how hard can it be? [01:18] let's find out [01:19] time to yak shave [01:21] underscor: what's warctool? [01:21] I don't see it in the hanzo repo [01:22] it just occurred to me that if that can recreate some sort of directory structure, then I can get pretty far by just running thin on that [01:23] yipdw: wouldn't you have to decompress the .gz ? I'm pretty sure gzip is not seekable by default [01:23] chronomex: not sure of the implementation details yet; haven't actually tried building this [01:23] i want to see if I can take the lazy way out by decompressing and extracting contents [01:23] probably [01:24] but the important part for me is being able to verify the archive in a browser [01:34] actually, just running thin on the non-WARC bits seems to work reasonably ok [01:34] it doesn't load all images but I can verify that those exist [01:35] There's something special about warc.gz files, where you can seek them and it just works™ [01:35] But I think that only applies to heretrix warcs [01:35] yipdw: http://warc-tools.googlecode.com/svn/trunk/app/python/ [01:36] I think warcdump.py does what you want [01:37] oh, I thought that repo was deprecated [01:37] well then [01:37] I'll give that a try [01:37] but first, time to cook stuff [01:37] it worked for me [01:37] but that was in july [01:38] so the wget-warc spec might have changed [02:10] mmm, sockeye salmon and quinoa [04:47] Hey, guys. [04:47] Oh my god, what an amazing hotel and location. [04:53] underscor: that should work for wget-warc files as well. it has the code to put the offset or length or whatever in the gzip extrainfo field [04:59] it is done in a rather hackish way, too, IMO [05:28] ciao a tutti [05:28] ciao [05:29] buon anno [05:33] argh [05:33] if anyone can actually build this -> https://github.com/openplanets/wap/tree/master/ArchiveFS [05:33] let me know [05:34] I hate it when code that is supposedly up-to-date depends on dependencies that are five or six years old [05:34] it means that that code is actually not up-to-date [05:34] ce: bit late for that, eh? :) [05:46] oh god, guys, really [05:46] the authors of this Maven file made it download a hundred or so dependencies [05:46] everything BUT a Fuse-J build [05:47] so they included automatic resolution of all the trivial dependencies [05:55] man, the more I look for a way to inspect WARCs, the more I am convinced that nobody actually has a way to do it [05:55] this is a bit unsettling [05:57] or, more precisely, many people have written their own ways to do it, but those ways are crazy ad-hoc and require a very specific environment [07:01] ah! [07:01] I got wayback up and running [07:02] let's see what happens if I shove some WARCs at it [07:34] wholy fuck [07:34] it works [07:34] o_o [07:34] rad [07:35] http://depot.ninjawedding.org/wayback.png [07:36] this is rad [07:36] "Think Proust is neat? " [07:36] no [07:36] think wayback is neat, yes [07:36] at least once you get it working [07:41] hmm, that's weird. I have resources in the WARC under proust.com/i/a22/... that are referenced from proust.com/story [07:41] but the Wayback Machine can't find them [07:41] dunno what that means [07:42] oh, they're not actually in the wARC [07:42] ok [08:25] hmm, here's a stumper re: Javascript and archival of pages [08:25] the Wayback Machine's viewer frame includes jQuery 1.3.2 [08:25] many sites make use of things like $(...).delegate() [08:25] which is not supported in jQuery 1.3.2 [08:25] and actually, Wayback's viewer is not an iframe [08:25] HMMMM. [08:25] so there's a jQuery conflict [08:26] I just noticed this while viewing Proust grabs [08:26] not sure how to fix it, aside from filing it as a Wayback Machine bug [08:28] also, does it make sense to archive things like http://www.google-analytics.com/ga.js if they're included in a page? [08:29] I mean, on the one hand it IS part of the page [08:29] but on the other: [08:29] (1) Google Analytics is fucking evil [08:29] (2) there is no way to block GA served via archive.org [08:29] without collateral damage to other scripts served from archive.org (or any Wayback server) [08:30] why is GA evil? [08:31] my perspective [08:31] sure, what makes you think so? [08:31] it reminds me too much of Deus Ex's Daedalus [08:32] and other such panopticons [08:32] I'm not familiar with this Daedalus, care to explain? [08:32] Daedalus, in Deus Ex, was a distributed system that reported on all communications to the Illuminati [08:32] ah [08:32] I never said it was a rational dislike :P [08:33] but I think point (2) is the more important one anyway [08:33] did I say anything about you being irrational? [08:33] it seems like you are insecure in your beliefs [08:33] no, just throwing that out there [08:33] if you believe it, own it. don't half-believe. [08:33] namely, that if someone has GA block rules set up via noscript, then this sort of thing is a runaround [08:34] that's true. perhaps you should email info@archive.org. [08:34] but consider this [08:34] wayback machine scraper saves everything including the HTTP headers. why would they alter the page itself? [08:34] well, the scraper doesn't alter the page [08:34] corrcet [08:35] I bet you could gin up some sort of noscript rules [08:35] I'm going to suck in all external javascripts, too [08:35] (I know nothing about noscript) [08:35] but [08:35] when the page is presented via the Wayback Machine, CSS and Javascript references are rewritten [08:35] mandang, fucking amazon charging me sales tax [08:36] it's not like I live in SEATTLE or anything [08:36] I also do not like GA, simply from the panopticon-like side. [08:36] oh wait, I do. I can see their old HQ from my house. [08:36] you think they just track you on one site? HA. they track you everywhere you go that uses GA [08:36] I like GA, and I wish it did more. [08:37] though I don't really use it much, I tend to lean harder on getclicky.com [08:37] oh, different audiences there [08:37] :P [08:37] if I had to use GA for, well, analytics, yeah, I'd love it [08:37] but that's because in that case I Have The Power [08:37] :P [08:38] web's quite different when you're on the other side of things [08:38] why not just use a log analysis tool to get the stats? [08:38] yes [08:38] changes your perspective rather forcibly [08:38] I've abused that power quite often [08:38] like changing images on my site that get lots of external references [08:38] Coderjoe: will that get you screen resolution or amount of time the user kept the page open before leaving? [08:39] the guy running rosettacode.org has noticed that GA only shows around 50% or so of his traffic (comparing to log files) [08:39] GA samples stochastically when you get above some threshold [08:39] yipdw: ever given the hello.jpg to those bandwidth thieves? [08:40] Coderjoe: no, but I advise you not to browse animemusicvideos.org/forum at work anymore [08:40] hahahaha [08:40] there are several replacements of facepalm-256.gif with penises [08:40] O_o [08:40] excellent [08:40] yay [08:41] world needs more penises [08:41] 3.5 billion isn't enough [08:44] not by a long shot [08:44] there must be more than 3.5 billion penises in the world [08:44] it's about the right order of magnitude [08:45] what about all those insects [08:46] oh, I was just counting humans [08:46] but, uh [08:46] dicks aside [08:47] wayback rewrites script src URLs from e.g. http://platform.twitter.com/widgets.js to something like http://192.168.122.246:8080/20120106083222js_/http://platform.twitter.com/widgets.js [08:47] makes it tricky [08:48] I'll grab them all; proust seems to break in weird ways if some externals aren't present [11:36] Mmmh, DOS games [18:43] huh, what the fuck [18:43] FileUtils#mkdir_p is returning Errno::EEXIST [18:43] that...should not happen [18:44] oH! [18:44] because it's a dead symlink [18:44] oops [19:02] also, if anyone here is running a Redis instance, I'd appreciate help running the discovery scripts in https://github.com/ArchiveTeam/proust-pulling [19:02] it seems the best shot for converging on a full set of users is to just run those over and over [19:13] wait, never mind, I'm a moron [19:13] I jsut ralized proust publishes a sitemap that includes all public stories [19:13] and that I cant' type [19:20] I mentioned the sitemap, but I think it hasn't been updated in awhile [19:20] there were some new stories on the people page that I don't think were in the sitemap [19:31] FILM STREAMING! http://quickstream.altervista.org/ [19:31] FILM STREAMING! http://quickstream.altervista.org/ [19:31] FILM STREAMING! http://quickstream.altervista.org/ [19:31] FILM STREAMING! http://quickstream.altervista.org/ [19:31] FILM STREAMING! http://quickstream.altervista.org/ [19:31] FILM STREAMING! http://quickstream.altervista.org/ [19:31] FILM STREAMING! http://quickstream.altervista.org/ [19:31] FILM STREAMING! http://quickstream.altervista.org/ [19:31] F! F! S! [19:31] Coderjoe: yeah, but I can combine it with the existing sources [19:32] yipdw: yeah, that's what I was expecting would happen. at least pulling people from it. [19:32] proust's robots.txt also disallows /js and /css [19:32] I'm debating whether or not to abide by that [19:32] standard protocol here is, I think, "fuck it" [19:33] but those are pretty important to having a useful copy [19:33] If you're "lawful good", just download those files manually. Can't be many of them? [19:33] Not that it would matter right? [19:33] that would work if I was downloading the whole site in one go [19:33] which I do have a script for [19:34] but there's no real way to do that on a per-user basis that doesn't involve a robot [19:34] Miroring a Yahoo Group worked a charm for an open group [19:34] it's more like: on one hand, no, the CSS and JS are not important wrt the content is concerned [19:34] Still need to log in for a members-only one [19:35] (Using an ugly wget hack) [19:35] on the other hand humans will be looking at at least part of this dataset and presentation then matters quite a bit [20:21] hey all [20:21] SketchCow: you there? [23:16] !list [23:38] we should have an automatic kick on !list [23:38] or maybe an autoresponder with our list of torrents [23:56] the.GIMP.ultimate.inc.keygen.DiViNiTy [23:59] Anyone know if 28c3 is up on archive.org yet?