#archiveteam 2012-01-06,Fri

↑back Search

Time Nickname Message
00:52 πŸ”— yipdw oh nice
00:52 πŸ”— yipdw http://archive-access.sourceforge.net/projects/wayback/hadoop.html
00:52 πŸ”— yipdw it looks like that's the WARC ingester/browser I've been looking for
00:53 πŸ”— * yipdw gives it a shot
00:56 πŸ”— chronomex don't shoot!
00:58 πŸ”— underscor <Frooxius> It's one of my catchphrases
00:58 πŸ”— underscor <Frooxius> it's missing a tryphrase though
00:58 πŸ”— yipdw that was lamebda
00:58 πŸ”— * underscor cries
00:59 πŸ”— yipdw i can't come up with a continuation to that though
01:07 πŸ”— yipdw ugh
01:07 πŸ”— yipdw maybe I'm just kinda sleepy and/or stupid but I cannot figure out how to upload a WARC file to wayback
01:08 πŸ”— yipdw the administrator manual is quite dense too
01:09 πŸ”— yipdw oh, ugh, I have to write XML
01:10 πŸ”— underscor yipdw: I find it easier to just extract the contents with warctool
01:10 πŸ”— yipdw underscor: yeah, but I can't easily see what I got that way
01:10 πŸ”— yipdw flat files are a terrible way to check hypertext
01:10 πŸ”— underscor You can see which files, but yeah
01:10 πŸ”— yipdw https://github.com/openplanets/wap/tree/master/ArchiveExplorer looks like it's what I want
01:11 πŸ”— yipdw ...sort of
01:14 πŸ”— yipdw ha, damn, proust uses a lot of separate image files
01:15 πŸ”— yipdw I'm surprised that they didn't use one of the many many solutions out there for CSS spriting
01:15 πŸ”— yipdw oh wait no
01:16 πŸ”— yipdw this ArchiveExplorer is garbage
01:16 πŸ”— yipdw ok, this is what I want
01:16 πŸ”— yipdw I want a tool that takes as input the path to a *.warc.gz file
01:16 πŸ”— yipdw starts up a webserver
01:16 πŸ”— yipdw and then mounts the warc.gz at /
01:16 πŸ”— yipdw you can then access the started webserver via any web browser.
01:16 πŸ”— yipdw how hard can it be?
01:18 πŸ”— yipdw let's find out
01:19 πŸ”— yipdw time to yak shave
01:21 πŸ”— yipdw underscor: what's warctool?
01:21 πŸ”— yipdw I don't see it in the hanzo repo
01:22 πŸ”— yipdw it just occurred to me that if that can recreate some sort of directory structure, then I can get pretty far by just running thin on that
01:23 πŸ”— chronomex yipdw: wouldn't you have to decompress the .gz ? I'm pretty sure gzip is not seekable by default
01:23 πŸ”— yipdw chronomex: not sure of the implementation details yet; haven't actually tried building this
01:23 πŸ”— yipdw i want to see if I can take the lazy way out by decompressing and extracting contents
01:23 πŸ”— chronomex probably
01:24 πŸ”— yipdw but the important part for me is being able to verify the archive in a browser
01:34 πŸ”— yipdw actually, just running thin on the non-WARC bits seems to work reasonably ok
01:34 πŸ”— yipdw it doesn't load all images but I can verify that those exist
01:35 πŸ”— underscor There's something special about warc.gz files, where you can seek them and it just works҄’
01:35 πŸ”— underscor But I think that only applies to heretrix warcs
01:35 πŸ”— underscor yipdw: http://warc-tools.googlecode.com/svn/trunk/app/python/
01:36 πŸ”— underscor I think warcdump.py does what you want
01:37 πŸ”— yipdw oh, I thought that repo was deprecated
01:37 πŸ”— yipdw well then
01:37 πŸ”— yipdw I'll give that a try
01:37 πŸ”— yipdw but first, time to cook stuff
01:37 πŸ”— underscor it worked for me
01:37 πŸ”— underscor but that was in july
01:38 πŸ”— underscor so the wget-warc spec might have changed
02:10 πŸ”— yipdw mmm, sockeye salmon and quinoa
04:47 πŸ”— SketchCow Hey, guys.
04:47 πŸ”— SketchCow Oh my god, what an amazing hotel and location.
04:53 πŸ”— Coderjoe underscor: that should work for wget-warc files as well. it has the code to put the offset or length or whatever in the gzip extrainfo field
04:59 πŸ”— Coderjoe it is done in a rather hackish way, too, IMO
05:28 πŸ”— ce ciao a tutti
05:28 πŸ”— chronomex ciao
05:29 πŸ”— ce buon anno
05:33 πŸ”— yipdw argh
05:33 πŸ”— yipdw if anyone can actually build this -> https://github.com/openplanets/wap/tree/master/ArchiveFS
05:33 πŸ”— yipdw let me know
05:34 πŸ”— yipdw I hate it when code that is supposedly up-to-date depends on dependencies that are five or six years old
05:34 πŸ”— yipdw it means that that code is actually not up-to-date
05:34 πŸ”— chronomex ce: bit late for that, eh? :)
05:46 πŸ”— yipdw oh god, guys, really
05:46 πŸ”— yipdw the authors of this Maven file made it download a hundred or so dependencies
05:46 πŸ”— yipdw everything BUT a Fuse-J build
05:47 πŸ”— yipdw so they included automatic resolution of all the trivial dependencies
05:55 πŸ”— yipdw man, the more I look for a way to inspect WARCs, the more I am convinced that nobody actually has a way to do it
05:55 πŸ”— yipdw this is a bit unsettling
05:57 πŸ”— yipdw or, more precisely, many people have written their own ways to do it, but those ways are crazy ad-hoc and require a very specific environment
07:01 πŸ”— yipdw ah!
07:01 πŸ”— yipdw I got wayback up and running
07:02 πŸ”— yipdw let's see what happens if I shove some WARCs at it
07:34 πŸ”— yipdw wholy fuck
07:34 πŸ”— yipdw it works
07:34 πŸ”— chronomex o_o
07:34 πŸ”— chronomex rad
07:35 πŸ”— yipdw http://depot.ninjawedding.org/wayback.png
07:36 πŸ”— yipdw this is rad
07:36 πŸ”— yipdw "Think Proust is neat? "
07:36 πŸ”— yipdw no
07:36 πŸ”— yipdw think wayback is neat, yes
07:36 πŸ”— yipdw at least once you get it working
07:41 πŸ”— yipdw hmm, that's weird. I have resources in the WARC under proust.com/i/a22/... that are referenced from proust.com/story
07:41 πŸ”— yipdw but the Wayback Machine can't find them
07:41 πŸ”— yipdw dunno what that means
07:42 πŸ”— yipdw oh, they're not actually in the wARC
07:42 πŸ”— yipdw ok
08:25 πŸ”— yipdw hmm, here's a stumper re: Javascript and archival of pages
08:25 πŸ”— yipdw the Wayback Machine's viewer frame includes jQuery 1.3.2
08:25 πŸ”— yipdw many sites make use of things like $(...).delegate()
08:25 πŸ”— yipdw which is not supported in jQuery 1.3.2
08:25 πŸ”— yipdw and actually, Wayback's viewer is not an iframe
08:25 πŸ”— chronomex HMMMM.
08:25 πŸ”— yipdw so there's a jQuery conflict
08:26 πŸ”— yipdw I just noticed this while viewing Proust grabs
08:26 πŸ”— yipdw not sure how to fix it, aside from filing it as a Wayback Machine bug
08:28 πŸ”— yipdw also, does it make sense to archive things like http://www.google-analytics.com/ga.js if they're included in a page?
08:29 πŸ”— yipdw I mean, on the one hand it IS part of the page
08:29 πŸ”— yipdw but on the other:
08:29 πŸ”— yipdw (1) Google Analytics is fucking evil
08:29 πŸ”— yipdw (2) there is no way to block GA served via archive.org
08:29 πŸ”— yipdw without collateral damage to other scripts served from archive.org (or any Wayback server)
08:30 πŸ”— chronomex why is GA evil?
08:31 πŸ”— yipdw my perspective
08:31 πŸ”— chronomex sure, what makes you think so?
08:31 πŸ”— yipdw it reminds me too much of Deus Ex's Daedalus
08:32 πŸ”— yipdw and other such panopticons
08:32 πŸ”— chronomex I'm not familiar with this Daedalus, care to explain?
08:32 πŸ”— yipdw Daedalus, in Deus Ex, was a distributed system that reported on all communications to the Illuminati
08:32 πŸ”— chronomex ah
08:32 πŸ”— yipdw I never said it was a rational dislike :P
08:33 πŸ”— yipdw but I think point (2) is the more important one anyway
08:33 πŸ”— chronomex did I say anything about you being irrational?
08:33 πŸ”— chronomex it seems like you are insecure in your beliefs
08:33 πŸ”— yipdw no, just throwing that out there
08:33 πŸ”— chronomex if you believe it, own it. don't half-believe.
08:33 πŸ”— yipdw namely, that if someone has GA block rules set up via noscript, then this sort of thing is a runaround
08:34 πŸ”— chronomex that's true. perhaps you should email info@archive.org.
08:34 πŸ”— chronomex but consider this
08:34 πŸ”— chronomex wayback machine scraper saves everything including the HTTP headers. why would they alter the page itself?
08:34 πŸ”— yipdw well, the scraper doesn't alter the page
08:34 πŸ”— chronomex corrcet
08:35 πŸ”— chronomex I bet you could gin up some sort of noscript rules
08:35 πŸ”— yipdw I'm going to suck in all external javascripts, too
08:35 πŸ”— chronomex (I know nothing about noscript)
08:35 πŸ”— yipdw but
08:35 πŸ”— yipdw when the page is presented via the Wayback Machine, CSS and Javascript references are rewritten
08:35 πŸ”— chronomex mandang, fucking amazon charging me sales tax
08:36 πŸ”— chronomex it's not like I live in SEATTLE or anything
08:36 πŸ”— Coderjoe I also do not like GA, simply from the panopticon-like side.
08:36 πŸ”— chronomex oh wait, I do. I can see their old HQ from my house.
08:36 πŸ”— Coderjoe you think they just track you on one site? HA. they track you everywhere you go that uses GA
08:36 πŸ”— chronomex I like GA, and I wish it did more.
08:37 πŸ”— chronomex though I don't really use it much, I tend to lean harder on getclicky.com
08:37 πŸ”— yipdw oh, different audiences there
08:37 πŸ”— chronomex :P
08:37 πŸ”— yipdw if I had to use GA for, well, analytics, yeah, I'd love it
08:37 πŸ”— yipdw but that's because in that case I Have The Power
08:37 πŸ”— yipdw :P
08:38 πŸ”— chronomex web's quite different when you're on the other side of things
08:38 πŸ”— Coderjoe why not just use a log analysis tool to get the stats?
08:38 πŸ”— yipdw yes
08:38 πŸ”— chronomex changes your perspective rather forcibly
08:38 πŸ”— yipdw I've abused that power quite often
08:38 πŸ”— yipdw like changing images on my site that get lots of external references
08:38 πŸ”— chronomex Coderjoe: will that get you screen resolution or amount of time the user kept the page open before leaving?
08:39 πŸ”— Coderjoe the guy running rosettacode.org has noticed that GA only shows around 50% or so of his traffic (comparing to log files)
08:39 πŸ”— chronomex GA samples stochastically when you get above some threshold
08:39 πŸ”— Coderjoe yipdw: ever given the hello.jpg to those bandwidth thieves?
08:40 πŸ”— yipdw Coderjoe: no, but I advise you not to browse animemusicvideos.org/forum at work anymore
08:40 πŸ”— chronomex hahahaha
08:40 πŸ”— yipdw there are several replacements of facepalm-256.gif with penises
08:40 πŸ”— Coderjoe O_o
08:40 πŸ”— chronomex excellent
08:40 πŸ”— Coderjoe yay
08:41 πŸ”— chronomex world needs more penises
08:41 πŸ”— chronomex 3.5 billion isn't enough
08:44 πŸ”— chronomex not by a long shot
08:44 πŸ”— yipdw there must be more than 3.5 billion penises in the world
08:44 πŸ”— chronomex it's about the right order of magnitude
08:45 πŸ”— yipdw what about all those insects
08:46 πŸ”— chronomex oh, I was just counting humans
08:46 πŸ”— yipdw but, uh
08:46 πŸ”— yipdw dicks aside
08:47 πŸ”— yipdw wayback rewrites script src URLs from e.g. http://platform.twitter.com/widgets.js to something like
08:47 πŸ”— yipdw makes it tricky
08:48 πŸ”— yipdw I'll grab them all; proust seems to break in weird ways if some externals aren't present
11:36 πŸ”— ersi Mmmh, DOS games
18:43 πŸ”— yipdw huh, what the fuck
18:43 πŸ”— yipdw FileUtils#mkdir_p is returning Errno::EEXIST
18:43 πŸ”— yipdw that...should not happen
18:44 πŸ”— yipdw oH!
18:44 πŸ”— yipdw because it's a dead symlink
18:44 πŸ”— yipdw oops
19:02 πŸ”— yipdw also, if anyone here is running a Redis instance, I'd appreciate help running the discovery scripts in https://github.com/ArchiveTeam/proust-pulling
19:02 πŸ”— yipdw it seems the best shot for converging on a full set of users is to just run those over and over
19:13 πŸ”— yipdw wait, never mind, I'm a moron
19:13 πŸ”— yipdw I jsut ralized proust publishes a sitemap that includes all public stories
19:13 πŸ”— yipdw and that I cant' type
19:20 πŸ”— Coderjoe I mentioned the sitemap, but I think it hasn't been updated in awhile
19:20 πŸ”— Coderjoe there were some new stories on the people page that I don't think were in the sitemap
19:31 πŸ”— QUICKSTRE FILM STREAMING! http://quickstream.altervista.org/
19:31 πŸ”— QUICKSTRE FILM STREAMING! http://quickstream.altervista.org/
19:31 πŸ”— QUICKSTRE FILM STREAMING! http://quickstream.altervista.org/
19:31 πŸ”— QUICKSTRE FILM STREAMING! http://quickstream.altervista.org/
19:31 πŸ”— QUICKSTRE FILM STREAMING! http://quickstream.altervista.org/
19:31 πŸ”— QUICKSTRE FILM STREAMING! http://quickstream.altervista.org/
19:31 πŸ”— QUICKSTRE FILM STREAMING! http://quickstream.altervista.org/
19:31 πŸ”— QUICKSTRE FILM STREAMING! http://quickstream.altervista.org/
19:31 πŸ”— nitro2k01 F! F! S!
19:31 πŸ”— yipdw Coderjoe: yeah, but I can combine it with the existing sources
19:32 πŸ”— Coderjoe yipdw: yeah, that's what I was expecting would happen. at least pulling people from it.
19:32 πŸ”— yipdw proust's robots.txt also disallows /js and /css
19:32 πŸ”— yipdw I'm debating whether or not to abide by that
19:32 πŸ”— yipdw standard protocol here is, I think, "fuck it"
19:33 πŸ”— yipdw but those are pretty important to having a useful copy
19:33 πŸ”— nitro2k01 If you're "lawful good", just download those files manually. Can't be many of them?
19:33 πŸ”— nitro2k01 Not that it would matter right?
19:33 πŸ”— yipdw that would work if I was downloading the whole site in one go
19:33 πŸ”— yipdw which I do have a script for
19:34 πŸ”— yipdw but there's no real way to do that on a per-user basis that doesn't involve a robot
19:34 πŸ”— nitro2k01 Miroring a Yahoo Group worked a charm for an open group
19:34 πŸ”— yipdw it's more like: on one hand, no, the CSS and JS are not important wrt the content is concerned
19:34 πŸ”— nitro2k01 Still need to log in for a members-only one
19:35 πŸ”— nitro2k01 (Using an ugly wget hack)
19:35 πŸ”— yipdw on the other hand humans will be looking at at least part of this dataset and presentation then matters quite a bit
20:21 πŸ”— balrog_ hey all
20:21 πŸ”— balrog_ SketchCow: you there?
23:16 πŸ”— salvo !list
23:38 πŸ”— chronomex we should have an automatic kick on !list
23:38 πŸ”— chronomex or maybe an autoresponder with our list of torrents
23:56 πŸ”— dnova the.GIMP.ultimate.inc.keygen.DiViNiTy
23:59 πŸ”— PatC Anyone know if 28c3 is up on archive.org yet?
