Time |
Nickname |
Message |
00:52
π
|
yipdw |
oh nice |
00:52
π
|
yipdw |
http://archive-access.sourceforge.net/projects/wayback/hadoop.html |
00:52
π
|
yipdw |
it looks like that's the WARC ingester/browser I've been looking for |
00:53
π
|
* |
yipdw gives it a shot |
00:56
π
|
chronomex |
don't shoot! |
00:58
π
|
underscor |
<Frooxius> It's one of my catchphrases |
00:58
π
|
underscor |
<Frooxius> it's missing a tryphrase though |
00:58
π
|
yipdw |
that was lamebda |
00:58
π
|
* |
underscor cries |
00:59
π
|
yipdw |
i can't come up with a continuation to that though |
01:07
π
|
yipdw |
ugh |
01:07
π
|
yipdw |
maybe I'm just kinda sleepy and/or stupid but I cannot figure out how to upload a WARC file to wayback |
01:08
π
|
yipdw |
the administrator manual is quite dense too |
01:09
π
|
yipdw |
oh, ugh, I have to write XML |
01:10
π
|
underscor |
yipdw: I find it easier to just extract the contents with warctool |
01:10
π
|
yipdw |
underscor: yeah, but I can't easily see what I got that way |
01:10
π
|
yipdw |
flat files are a terrible way to check hypertext |
01:10
π
|
underscor |
You can see which files, but yeah |
01:10
π
|
yipdw |
https://github.com/openplanets/wap/tree/master/ArchiveExplorer looks like it's what I want |
01:11
π
|
yipdw |
...sort of |
01:14
π
|
yipdw |
ha, damn, proust uses a lot of separate image files |
01:15
π
|
yipdw |
I'm surprised that they didn't use one of the many many solutions out there for CSS spriting |
01:15
π
|
yipdw |
oh wait no |
01:16
π
|
yipdw |
this ArchiveExplorer is garbage |
01:16
π
|
yipdw |
ok, this is what I want |
01:16
π
|
yipdw |
I want a tool that takes as input the path to a *.warc.gz file |
01:16
π
|
yipdw |
starts up a webserver |
01:16
π
|
yipdw |
and then mounts the warc.gz at / |
01:16
π
|
yipdw |
you can then access the started webserver via any web browser. |
01:16
π
|
yipdw |
how hard can it be? |
01:18
π
|
yipdw |
let's find out |
01:19
π
|
yipdw |
time to yak shave |
01:21
π
|
yipdw |
underscor: what's warctool? |
01:21
π
|
yipdw |
I don't see it in the hanzo repo |
01:22
π
|
yipdw |
it just occurred to me that if that can recreate some sort of directory structure, then I can get pretty far by just running thin on that |
01:23
π
|
chronomex |
yipdw: wouldn't you have to decompress the .gz ? I'm pretty sure gzip is not seekable by default |
01:23
π
|
yipdw |
chronomex: not sure of the implementation details yet; haven't actually tried building this |
01:23
π
|
yipdw |
i want to see if I can take the lazy way out by decompressing and extracting contents |
01:23
π
|
chronomex |
probably |
01:24
π
|
yipdw |
but the important part for me is being able to verify the archive in a browser |
01:34
π
|
yipdw |
actually, just running thin on the non-WARC bits seems to work reasonably ok |
01:34
π
|
yipdw |
it doesn't load all images but I can verify that those exist |
01:35
π
|
underscor |
There's something special about warc.gz files, where you can seek them and it just worksΓ’ΒΒ’ |
01:35
π
|
underscor |
But I think that only applies to heretrix warcs |
01:35
π
|
underscor |
yipdw: http://warc-tools.googlecode.com/svn/trunk/app/python/ |
01:36
π
|
underscor |
I think warcdump.py does what you want |
01:37
π
|
yipdw |
oh, I thought that repo was deprecated |
01:37
π
|
yipdw |
well then |
01:37
π
|
yipdw |
I'll give that a try |
01:37
π
|
yipdw |
but first, time to cook stuff |
01:37
π
|
underscor |
it worked for me |
01:37
π
|
underscor |
but that was in july |
01:38
π
|
underscor |
so the wget-warc spec might have changed |
02:10
π
|
yipdw |
mmm, sockeye salmon and quinoa |
04:47
π
|
SketchCow |
Hey, guys. |
04:47
π
|
SketchCow |
Oh my god, what an amazing hotel and location. |
04:53
π
|
Coderjoe |
underscor: that should work for wget-warc files as well. it has the code to put the offset or length or whatever in the gzip extrainfo field |
04:59
π
|
Coderjoe |
it is done in a rather hackish way, too, IMO |
05:28
π
|
ce |
ciao a tutti |
05:28
π
|
chronomex |
ciao |
05:29
π
|
ce |
buon anno |
05:33
π
|
yipdw |
argh |
05:33
π
|
yipdw |
if anyone can actually build this -> https://github.com/openplanets/wap/tree/master/ArchiveFS |
05:33
π
|
yipdw |
let me know |
05:34
π
|
yipdw |
I hate it when code that is supposedly up-to-date depends on dependencies that are five or six years old |
05:34
π
|
yipdw |
it means that that code is actually not up-to-date |
05:34
π
|
chronomex |
ce: bit late for that, eh? :) |
05:46
π
|
yipdw |
oh god, guys, really |
05:46
π
|
yipdw |
the authors of this Maven file made it download a hundred or so dependencies |
05:46
π
|
yipdw |
everything BUT a Fuse-J build |
05:47
π
|
yipdw |
so they included automatic resolution of all the trivial dependencies |
05:55
π
|
yipdw |
man, the more I look for a way to inspect WARCs, the more I am convinced that nobody actually has a way to do it |
05:55
π
|
yipdw |
this is a bit unsettling |
05:57
π
|
yipdw |
or, more precisely, many people have written their own ways to do it, but those ways are crazy ad-hoc and require a very specific environment |
07:01
π
|
yipdw |
ah! |
07:01
π
|
yipdw |
I got wayback up and running |
07:02
π
|
yipdw |
let's see what happens if I shove some WARCs at it |
07:34
π
|
yipdw |
wholy fuck |
07:34
π
|
yipdw |
it works |
07:34
π
|
chronomex |
o_o |
07:34
π
|
chronomex |
rad |
07:35
π
|
yipdw |
http://depot.ninjawedding.org/wayback.png |
07:36
π
|
yipdw |
this is rad |
07:36
π
|
yipdw |
"Think Proust is neat? " |
07:36
π
|
yipdw |
no |
07:36
π
|
yipdw |
think wayback is neat, yes |
07:36
π
|
yipdw |
at least once you get it working |
07:41
π
|
yipdw |
hmm, that's weird. I have resources in the WARC under proust.com/i/a22/... that are referenced from proust.com/story |
07:41
π
|
yipdw |
but the Wayback Machine can't find them |
07:41
π
|
yipdw |
dunno what that means |
07:42
π
|
yipdw |
oh, they're not actually in the wARC |
07:42
π
|
yipdw |
ok |
08:25
π
|
yipdw |
hmm, here's a stumper re: Javascript and archival of pages |
08:25
π
|
yipdw |
the Wayback Machine's viewer frame includes jQuery 1.3.2 |
08:25
π
|
yipdw |
many sites make use of things like $(...).delegate() |
08:25
π
|
yipdw |
which is not supported in jQuery 1.3.2 |
08:25
π
|
yipdw |
and actually, Wayback's viewer is not an iframe |
08:25
π
|
chronomex |
HMMMM. |
08:25
π
|
yipdw |
so there's a jQuery conflict |
08:26
π
|
yipdw |
I just noticed this while viewing Proust grabs |
08:26
π
|
yipdw |
not sure how to fix it, aside from filing it as a Wayback Machine bug |
08:28
π
|
yipdw |
also, does it make sense to archive things like http://www.google-analytics.com/ga.js if they're included in a page? |
08:29
π
|
yipdw |
I mean, on the one hand it IS part of the page |
08:29
π
|
yipdw |
but on the other: |
08:29
π
|
yipdw |
(1) Google Analytics is fucking evil |
08:29
π
|
yipdw |
(2) there is no way to block GA served via archive.org |
08:29
π
|
yipdw |
without collateral damage to other scripts served from archive.org (or any Wayback server) |
08:30
π
|
chronomex |
why is GA evil? |
08:31
π
|
yipdw |
my perspective |
08:31
π
|
chronomex |
sure, what makes you think so? |
08:31
π
|
yipdw |
it reminds me too much of Deus Ex's Daedalus |
08:32
π
|
yipdw |
and other such panopticons |
08:32
π
|
chronomex |
I'm not familiar with this Daedalus, care to explain? |
08:32
π
|
yipdw |
Daedalus, in Deus Ex, was a distributed system that reported on all communications to the Illuminati |
08:32
π
|
chronomex |
ah |
08:32
π
|
yipdw |
I never said it was a rational dislike :P |
08:33
π
|
yipdw |
but I think point (2) is the more important one anyway |
08:33
π
|
chronomex |
did I say anything about you being irrational? |
08:33
π
|
chronomex |
it seems like you are insecure in your beliefs |
08:33
π
|
yipdw |
no, just throwing that out there |
08:33
π
|
chronomex |
if you believe it, own it. don't half-believe. |
08:33
π
|
yipdw |
namely, that if someone has GA block rules set up via noscript, then this sort of thing is a runaround |
08:34
π
|
chronomex |
that's true. perhaps you should email info@archive.org. |
08:34
π
|
chronomex |
but consider this |
08:34
π
|
chronomex |
wayback machine scraper saves everything including the HTTP headers. why would they alter the page itself? |
08:34
π
|
yipdw |
well, the scraper doesn't alter the page |
08:34
π
|
chronomex |
corrcet |
08:35
π
|
chronomex |
I bet you could gin up some sort of noscript rules |
08:35
π
|
yipdw |
I'm going to suck in all external javascripts, too |
08:35
π
|
chronomex |
(I know nothing about noscript) |
08:35
π
|
yipdw |
but |
08:35
π
|
yipdw |
when the page is presented via the Wayback Machine, CSS and Javascript references are rewritten |
08:35
π
|
chronomex |
mandang, fucking amazon charging me sales tax |
08:36
π
|
chronomex |
it's not like I live in SEATTLE or anything |
08:36
π
|
Coderjoe |
I also do not like GA, simply from the panopticon-like side. |
08:36
π
|
chronomex |
oh wait, I do. I can see their old HQ from my house. |
08:36
π
|
Coderjoe |
you think they just track you on one site? HA. they track you everywhere you go that uses GA |
08:36
π
|
chronomex |
I like GA, and I wish it did more. |
08:37
π
|
chronomex |
though I don't really use it much, I tend to lean harder on getclicky.com |
08:37
π
|
yipdw |
oh, different audiences there |
08:37
π
|
chronomex |
:P |
08:37
π
|
yipdw |
if I had to use GA for, well, analytics, yeah, I'd love it |
08:37
π
|
yipdw |
but that's because in that case I Have The Power |
08:37
π
|
yipdw |
:P |
08:38
π
|
chronomex |
web's quite different when you're on the other side of things |
08:38
π
|
Coderjoe |
why not just use a log analysis tool to get the stats? |
08:38
π
|
yipdw |
yes |
08:38
π
|
chronomex |
changes your perspective rather forcibly |
08:38
π
|
yipdw |
I've abused that power quite often |
08:38
π
|
yipdw |
like changing images on my site that get lots of external references |
08:38
π
|
chronomex |
Coderjoe: will that get you screen resolution or amount of time the user kept the page open before leaving? |
08:39
π
|
Coderjoe |
the guy running rosettacode.org has noticed that GA only shows around 50% or so of his traffic (comparing to log files) |
08:39
π
|
chronomex |
GA samples stochastically when you get above some threshold |
08:39
π
|
Coderjoe |
yipdw: ever given the hello.jpg to those bandwidth thieves? |
08:40
π
|
yipdw |
Coderjoe: no, but I advise you not to browse animemusicvideos.org/forum at work anymore |
08:40
π
|
chronomex |
hahahaha |
08:40
π
|
yipdw |
there are several replacements of facepalm-256.gif with penises |
08:40
π
|
Coderjoe |
O_o |
08:40
π
|
chronomex |
excellent |
08:40
π
|
Coderjoe |
yay |
08:41
π
|
chronomex |
world needs more penises |
08:41
π
|
chronomex |
3.5 billion isn't enough |
08:44
π
|
chronomex |
not by a long shot |
08:44
π
|
yipdw |
there must be more than 3.5 billion penises in the world |
08:44
π
|
chronomex |
it's about the right order of magnitude |
08:45
π
|
yipdw |
what about all those insects |
08:46
π
|
chronomex |
oh, I was just counting humans |
08:46
π
|
yipdw |
but, uh |
08:46
π
|
yipdw |
dicks aside |
08:47
π
|
yipdw |
wayback rewrites script src URLs from e.g. http://platform.twitter.com/widgets.js to something like http://192.168.122.246:8080/20120106083222js_/http://platform.twitter.com/widgets.js |
08:47
π
|
yipdw |
makes it tricky |
08:48
π
|
yipdw |
I'll grab them all; proust seems to break in weird ways if some externals aren't present |
11:36
π
|
ersi |
Mmmh, DOS games |
18:43
π
|
yipdw |
huh, what the fuck |
18:43
π
|
yipdw |
FileUtils#mkdir_p is returning Errno::EEXIST |
18:43
π
|
yipdw |
that...should not happen |
18:44
π
|
yipdw |
oH! |
18:44
π
|
yipdw |
because it's a dead symlink |
18:44
π
|
yipdw |
oops |
19:02
π
|
yipdw |
also, if anyone here is running a Redis instance, I'd appreciate help running the discovery scripts in https://github.com/ArchiveTeam/proust-pulling |
19:02
π
|
yipdw |
it seems the best shot for converging on a full set of users is to just run those over and over |
19:13
π
|
yipdw |
wait, never mind, I'm a moron |
19:13
π
|
yipdw |
I jsut ralized proust publishes a sitemap that includes all public stories |
19:13
π
|
yipdw |
and that I cant' type |
19:20
π
|
Coderjoe |
I mentioned the sitemap, but I think it hasn't been updated in awhile |
19:20
π
|
Coderjoe |
there were some new stories on the people page that I don't think were in the sitemap |
19:31
π
|
QUICKSTRE |
FILM STREAMING! http://quickstream.altervista.org/ |
19:31
π
|
QUICKSTRE |
FILM STREAMING! http://quickstream.altervista.org/ |
19:31
π
|
QUICKSTRE |
FILM STREAMING! http://quickstream.altervista.org/ |
19:31
π
|
QUICKSTRE |
FILM STREAMING! http://quickstream.altervista.org/ |
19:31
π
|
QUICKSTRE |
FILM STREAMING! http://quickstream.altervista.org/ |
19:31
π
|
QUICKSTRE |
FILM STREAMING! http://quickstream.altervista.org/ |
19:31
π
|
QUICKSTRE |
FILM STREAMING! http://quickstream.altervista.org/ |
19:31
π
|
QUICKSTRE |
FILM STREAMING! http://quickstream.altervista.org/ |
19:31
π
|
nitro2k01 |
F! F! S! |
19:31
π
|
yipdw |
Coderjoe: yeah, but I can combine it with the existing sources |
19:32
π
|
Coderjoe |
yipdw: yeah, that's what I was expecting would happen. at least pulling people from it. |
19:32
π
|
yipdw |
proust's robots.txt also disallows /js and /css |
19:32
π
|
yipdw |
I'm debating whether or not to abide by that |
19:32
π
|
yipdw |
standard protocol here is, I think, "fuck it" |
19:33
π
|
yipdw |
but those are pretty important to having a useful copy |
19:33
π
|
nitro2k01 |
If you're "lawful good", just download those files manually. Can't be many of them? |
19:33
π
|
nitro2k01 |
Not that it would matter right? |
19:33
π
|
yipdw |
that would work if I was downloading the whole site in one go |
19:33
π
|
yipdw |
which I do have a script for |
19:34
π
|
yipdw |
but there's no real way to do that on a per-user basis that doesn't involve a robot |
19:34
π
|
nitro2k01 |
Miroring a Yahoo Group worked a charm for an open group |
19:34
π
|
yipdw |
it's more like: on one hand, no, the CSS and JS are not important wrt the content is concerned |
19:34
π
|
nitro2k01 |
Still need to log in for a members-only one |
19:35
π
|
nitro2k01 |
(Using an ugly wget hack) |
19:35
π
|
yipdw |
on the other hand humans will be looking at at least part of this dataset and presentation then matters quite a bit |
20:21
π
|
balrog_ |
hey all |
20:21
π
|
balrog_ |
SketchCow: you there? |
23:16
π
|
salvo |
!list |
23:38
π
|
chronomex |
we should have an automatic kick on !list |
23:38
π
|
chronomex |
or maybe an autoresponder with our list of torrents |
23:56
π
|
dnova |
the.GIMP.ultimate.inc.keygen.DiViNiTy |
23:59
π
|
PatC |
Anyone know if 28c3 is up on archive.org yet? |