| Time |
Nickname |
Message |
|
00:52
π
|
yipdw |
oh nice |
|
00:52
π
|
yipdw |
http://archive-access.sourceforge.net/projects/wayback/hadoop.html |
|
00:52
π
|
yipdw |
it looks like that's the WARC ingester/browser I've been looking for |
|
00:53
π
|
* |
yipdw gives it a shot |
|
00:56
π
|
chronomex |
don't shoot! |
|
00:58
π
|
underscor |
<Frooxius> It's one of my catchphrases |
|
00:58
π
|
underscor |
<Frooxius> it's missing a tryphrase though |
|
00:58
π
|
yipdw |
that was lamebda |
|
00:58
π
|
* |
underscor cries |
|
00:59
π
|
yipdw |
i can't come up with a continuation to that though |
|
01:07
π
|
yipdw |
ugh |
|
01:07
π
|
yipdw |
maybe I'm just kinda sleepy and/or stupid but I cannot figure out how to upload a WARC file to wayback |
|
01:08
π
|
yipdw |
the administrator manual is quite dense too |
|
01:09
π
|
yipdw |
oh, ugh, I have to write XML |
|
01:10
π
|
underscor |
yipdw: I find it easier to just extract the contents with warctool |
|
01:10
π
|
yipdw |
underscor: yeah, but I can't easily see what I got that way |
|
01:10
π
|
yipdw |
flat files are a terrible way to check hypertext |
|
01:10
π
|
underscor |
You can see which files, but yeah |
|
01:10
π
|
yipdw |
https://github.com/openplanets/wap/tree/master/ArchiveExplorer looks like it's what I want |
|
01:11
π
|
yipdw |
...sort of |
|
01:14
π
|
yipdw |
ha, damn, proust uses a lot of separate image files |
|
01:15
π
|
yipdw |
I'm surprised that they didn't use one of the many many solutions out there for CSS spriting |
|
01:15
π
|
yipdw |
oh wait no |
|
01:16
π
|
yipdw |
this ArchiveExplorer is garbage |
|
01:16
π
|
yipdw |
ok, this is what I want |
|
01:16
π
|
yipdw |
I want a tool that takes as input the path to a *.warc.gz file |
|
01:16
π
|
yipdw |
starts up a webserver |
|
01:16
π
|
yipdw |
and then mounts the warc.gz at / |
|
01:16
π
|
yipdw |
you can then access the started webserver via any web browser. |
|
01:16
π
|
yipdw |
how hard can it be? |
|
01:18
π
|
yipdw |
let's find out |
|
01:19
π
|
yipdw |
time to yak shave |
|
01:21
π
|
yipdw |
underscor: what's warctool? |
|
01:21
π
|
yipdw |
I don't see it in the hanzo repo |
|
01:22
π
|
yipdw |
it just occurred to me that if that can recreate some sort of directory structure, then I can get pretty far by just running thin on that |
|
01:23
π
|
chronomex |
yipdw: wouldn't you have to decompress the .gz ? I'm pretty sure gzip is not seekable by default |
|
01:23
π
|
yipdw |
chronomex: not sure of the implementation details yet; haven't actually tried building this |
|
01:23
π
|
yipdw |
i want to see if I can take the lazy way out by decompressing and extracting contents |
|
01:23
π
|
chronomex |
probably |
|
01:24
π
|
yipdw |
but the important part for me is being able to verify the archive in a browser |
|
01:34
π
|
yipdw |
actually, just running thin on the non-WARC bits seems to work reasonably ok |
|
01:34
π
|
yipdw |
it doesn't load all images but I can verify that those exist |
|
01:35
π
|
underscor |
There's something special about warc.gz files, where you can seek them and it just worksΓ’ΒΒ’ |
|
01:35
π
|
underscor |
But I think that only applies to heretrix warcs |
|
01:35
π
|
underscor |
yipdw: http://warc-tools.googlecode.com/svn/trunk/app/python/ |
|
01:36
π
|
underscor |
I think warcdump.py does what you want |
|
01:37
π
|
yipdw |
oh, I thought that repo was deprecated |
|
01:37
π
|
yipdw |
well then |
|
01:37
π
|
yipdw |
I'll give that a try |
|
01:37
π
|
yipdw |
but first, time to cook stuff |
|
01:37
π
|
underscor |
it worked for me |
|
01:37
π
|
underscor |
but that was in july |
|
01:38
π
|
underscor |
so the wget-warc spec might have changed |
|
02:10
π
|
yipdw |
mmm, sockeye salmon and quinoa |
|
04:47
π
|
SketchCow |
Hey, guys. |
|
04:47
π
|
SketchCow |
Oh my god, what an amazing hotel and location. |
|
04:53
π
|
Coderjoe |
underscor: that should work for wget-warc files as well. it has the code to put the offset or length or whatever in the gzip extrainfo field |
|
04:59
π
|
Coderjoe |
it is done in a rather hackish way, too, IMO |
|
05:28
π
|
ce |
ciao a tutti |
|
05:28
π
|
chronomex |
ciao |
|
05:29
π
|
ce |
buon anno |
|
05:33
π
|
yipdw |
argh |
|
05:33
π
|
yipdw |
if anyone can actually build this -> https://github.com/openplanets/wap/tree/master/ArchiveFS |
|
05:33
π
|
yipdw |
let me know |
|
05:34
π
|
yipdw |
I hate it when code that is supposedly up-to-date depends on dependencies that are five or six years old |
|
05:34
π
|
yipdw |
it means that that code is actually not up-to-date |
|
05:34
π
|
chronomex |
ce: bit late for that, eh? :) |
|
05:46
π
|
yipdw |
oh god, guys, really |
|
05:46
π
|
yipdw |
the authors of this Maven file made it download a hundred or so dependencies |
|
05:46
π
|
yipdw |
everything BUT a Fuse-J build |
|
05:47
π
|
yipdw |
so they included automatic resolution of all the trivial dependencies |
|
05:55
π
|
yipdw |
man, the more I look for a way to inspect WARCs, the more I am convinced that nobody actually has a way to do it |
|
05:55
π
|
yipdw |
this is a bit unsettling |
|
05:57
π
|
yipdw |
or, more precisely, many people have written their own ways to do it, but those ways are crazy ad-hoc and require a very specific environment |
|
07:01
π
|
yipdw |
ah! |
|
07:01
π
|
yipdw |
I got wayback up and running |
|
07:02
π
|
yipdw |
let's see what happens if I shove some WARCs at it |
|
07:34
π
|
yipdw |
wholy fuck |
|
07:34
π
|
yipdw |
it works |
|
07:34
π
|
chronomex |
o_o |
|
07:34
π
|
chronomex |
rad |
|
07:35
π
|
yipdw |
http://depot.ninjawedding.org/wayback.png |
|
07:36
π
|
yipdw |
this is rad |
|
07:36
π
|
yipdw |
"Think Proust is neat? " |
|
07:36
π
|
yipdw |
no |
|
07:36
π
|
yipdw |
think wayback is neat, yes |
|
07:36
π
|
yipdw |
at least once you get it working |
|
07:41
π
|
yipdw |
hmm, that's weird. I have resources in the WARC under proust.com/i/a22/... that are referenced from proust.com/story |
|
07:41
π
|
yipdw |
but the Wayback Machine can't find them |
|
07:41
π
|
yipdw |
dunno what that means |
|
07:42
π
|
yipdw |
oh, they're not actually in the wARC |
|
07:42
π
|
yipdw |
ok |
|
08:25
π
|
yipdw |
hmm, here's a stumper re: Javascript and archival of pages |
|
08:25
π
|
yipdw |
the Wayback Machine's viewer frame includes jQuery 1.3.2 |
|
08:25
π
|
yipdw |
many sites make use of things like $(...).delegate() |
|
08:25
π
|
yipdw |
which is not supported in jQuery 1.3.2 |
|
08:25
π
|
yipdw |
and actually, Wayback's viewer is not an iframe |
|
08:25
π
|
chronomex |
HMMMM. |
|
08:25
π
|
yipdw |
so there's a jQuery conflict |
|
08:26
π
|
yipdw |
I just noticed this while viewing Proust grabs |
|
08:26
π
|
yipdw |
not sure how to fix it, aside from filing it as a Wayback Machine bug |
|
08:28
π
|
yipdw |
also, does it make sense to archive things like http://www.google-analytics.com/ga.js if they're included in a page? |
|
08:29
π
|
yipdw |
I mean, on the one hand it IS part of the page |
|
08:29
π
|
yipdw |
but on the other: |
|
08:29
π
|
yipdw |
(1) Google Analytics is fucking evil |
|
08:29
π
|
yipdw |
(2) there is no way to block GA served via archive.org |
|
08:29
π
|
yipdw |
without collateral damage to other scripts served from archive.org (or any Wayback server) |
|
08:30
π
|
chronomex |
why is GA evil? |
|
08:31
π
|
yipdw |
my perspective |
|
08:31
π
|
chronomex |
sure, what makes you think so? |
|
08:31
π
|
yipdw |
it reminds me too much of Deus Ex's Daedalus |
|
08:32
π
|
yipdw |
and other such panopticons |
|
08:32
π
|
chronomex |
I'm not familiar with this Daedalus, care to explain? |
|
08:32
π
|
yipdw |
Daedalus, in Deus Ex, was a distributed system that reported on all communications to the Illuminati |
|
08:32
π
|
chronomex |
ah |
|
08:32
π
|
yipdw |
I never said it was a rational dislike :P |
|
08:33
π
|
yipdw |
but I think point (2) is the more important one anyway |
|
08:33
π
|
chronomex |
did I say anything about you being irrational? |
|
08:33
π
|
chronomex |
it seems like you are insecure in your beliefs |
|
08:33
π
|
yipdw |
no, just throwing that out there |
|
08:33
π
|
chronomex |
if you believe it, own it. don't half-believe. |
|
08:33
π
|
yipdw |
namely, that if someone has GA block rules set up via noscript, then this sort of thing is a runaround |
|
08:34
π
|
chronomex |
that's true. perhaps you should email info@archive.org. |
|
08:34
π
|
chronomex |
but consider this |
|
08:34
π
|
chronomex |
wayback machine scraper saves everything including the HTTP headers. why would they alter the page itself? |
|
08:34
π
|
yipdw |
well, the scraper doesn't alter the page |
|
08:34
π
|
chronomex |
corrcet |
|
08:35
π
|
chronomex |
I bet you could gin up some sort of noscript rules |
|
08:35
π
|
yipdw |
I'm going to suck in all external javascripts, too |
|
08:35
π
|
chronomex |
(I know nothing about noscript) |
|
08:35
π
|
yipdw |
but |
|
08:35
π
|
yipdw |
when the page is presented via the Wayback Machine, CSS and Javascript references are rewritten |
|
08:35
π
|
chronomex |
mandang, fucking amazon charging me sales tax |
|
08:36
π
|
chronomex |
it's not like I live in SEATTLE or anything |
|
08:36
π
|
Coderjoe |
I also do not like GA, simply from the panopticon-like side. |
|
08:36
π
|
chronomex |
oh wait, I do. I can see their old HQ from my house. |
|
08:36
π
|
Coderjoe |
you think they just track you on one site? HA. they track you everywhere you go that uses GA |
|
08:36
π
|
chronomex |
I like GA, and I wish it did more. |
|
08:37
π
|
chronomex |
though I don't really use it much, I tend to lean harder on getclicky.com |
|
08:37
π
|
yipdw |
oh, different audiences there |
|
08:37
π
|
chronomex |
:P |
|
08:37
π
|
yipdw |
if I had to use GA for, well, analytics, yeah, I'd love it |
|
08:37
π
|
yipdw |
but that's because in that case I Have The Power |
|
08:37
π
|
yipdw |
:P |
|
08:38
π
|
chronomex |
web's quite different when you're on the other side of things |
|
08:38
π
|
Coderjoe |
why not just use a log analysis tool to get the stats? |
|
08:38
π
|
yipdw |
yes |
|
08:38
π
|
chronomex |
changes your perspective rather forcibly |
|
08:38
π
|
yipdw |
I've abused that power quite often |
|
08:38
π
|
yipdw |
like changing images on my site that get lots of external references |
|
08:38
π
|
chronomex |
Coderjoe: will that get you screen resolution or amount of time the user kept the page open before leaving? |
|
08:39
π
|
Coderjoe |
the guy running rosettacode.org has noticed that GA only shows around 50% or so of his traffic (comparing to log files) |
|
08:39
π
|
chronomex |
GA samples stochastically when you get above some threshold |
|
08:39
π
|
Coderjoe |
yipdw: ever given the hello.jpg to those bandwidth thieves? |
|
08:40
π
|
yipdw |
Coderjoe: no, but I advise you not to browse animemusicvideos.org/forum at work anymore |
|
08:40
π
|
chronomex |
hahahaha |
|
08:40
π
|
yipdw |
there are several replacements of facepalm-256.gif with penises |
|
08:40
π
|
Coderjoe |
O_o |
|
08:40
π
|
chronomex |
excellent |
|
08:40
π
|
Coderjoe |
yay |
|
08:41
π
|
chronomex |
world needs more penises |
|
08:41
π
|
chronomex |
3.5 billion isn't enough |
|
08:44
π
|
chronomex |
not by a long shot |
|
08:44
π
|
yipdw |
there must be more than 3.5 billion penises in the world |
|
08:44
π
|
chronomex |
it's about the right order of magnitude |
|
08:45
π
|
yipdw |
what about all those insects |
|
08:46
π
|
chronomex |
oh, I was just counting humans |
|
08:46
π
|
yipdw |
but, uh |
|
08:46
π
|
yipdw |
dicks aside |
|
08:47
π
|
yipdw |
wayback rewrites script src URLs from e.g. http://platform.twitter.com/widgets.js to something like http://192.168.122.246:8080/20120106083222js_/http://platform.twitter.com/widgets.js |
|
08:47
π
|
yipdw |
makes it tricky |
|
08:48
π
|
yipdw |
I'll grab them all; proust seems to break in weird ways if some externals aren't present |
|
11:36
π
|
ersi |
Mmmh, DOS games |
|
18:43
π
|
yipdw |
huh, what the fuck |
|
18:43
π
|
yipdw |
FileUtils#mkdir_p is returning Errno::EEXIST |
|
18:43
π
|
yipdw |
that...should not happen |
|
18:44
π
|
yipdw |
oH! |
|
18:44
π
|
yipdw |
because it's a dead symlink |
|
18:44
π
|
yipdw |
oops |
|
19:02
π
|
yipdw |
also, if anyone here is running a Redis instance, I'd appreciate help running the discovery scripts in https://github.com/ArchiveTeam/proust-pulling |
|
19:02
π
|
yipdw |
it seems the best shot for converging on a full set of users is to just run those over and over |
|
19:13
π
|
yipdw |
wait, never mind, I'm a moron |
|
19:13
π
|
yipdw |
I jsut ralized proust publishes a sitemap that includes all public stories |
|
19:13
π
|
yipdw |
and that I cant' type |
|
19:20
π
|
Coderjoe |
I mentioned the sitemap, but I think it hasn't been updated in awhile |
|
19:20
π
|
Coderjoe |
there were some new stories on the people page that I don't think were in the sitemap |
|
19:31
π
|
QUICKSTRE |
FILM STREAMING! http://quickstream.altervista.org/ |
|
19:31
π
|
QUICKSTRE |
FILM STREAMING! http://quickstream.altervista.org/ |
|
19:31
π
|
QUICKSTRE |
FILM STREAMING! http://quickstream.altervista.org/ |
|
19:31
π
|
QUICKSTRE |
FILM STREAMING! http://quickstream.altervista.org/ |
|
19:31
π
|
QUICKSTRE |
FILM STREAMING! http://quickstream.altervista.org/ |
|
19:31
π
|
QUICKSTRE |
FILM STREAMING! http://quickstream.altervista.org/ |
|
19:31
π
|
QUICKSTRE |
FILM STREAMING! http://quickstream.altervista.org/ |
|
19:31
π
|
QUICKSTRE |
FILM STREAMING! http://quickstream.altervista.org/ |
|
19:31
π
|
nitro2k01 |
F! F! S! |
|
19:31
π
|
yipdw |
Coderjoe: yeah, but I can combine it with the existing sources |
|
19:32
π
|
Coderjoe |
yipdw: yeah, that's what I was expecting would happen. at least pulling people from it. |
|
19:32
π
|
yipdw |
proust's robots.txt also disallows /js and /css |
|
19:32
π
|
yipdw |
I'm debating whether or not to abide by that |
|
19:32
π
|
yipdw |
standard protocol here is, I think, "fuck it" |
|
19:33
π
|
yipdw |
but those are pretty important to having a useful copy |
|
19:33
π
|
nitro2k01 |
If you're "lawful good", just download those files manually. Can't be many of them? |
|
19:33
π
|
nitro2k01 |
Not that it would matter right? |
|
19:33
π
|
yipdw |
that would work if I was downloading the whole site in one go |
|
19:33
π
|
yipdw |
which I do have a script for |
|
19:34
π
|
yipdw |
but there's no real way to do that on a per-user basis that doesn't involve a robot |
|
19:34
π
|
nitro2k01 |
Miroring a Yahoo Group worked a charm for an open group |
|
19:34
π
|
yipdw |
it's more like: on one hand, no, the CSS and JS are not important wrt the content is concerned |
|
19:34
π
|
nitro2k01 |
Still need to log in for a members-only one |
|
19:35
π
|
nitro2k01 |
(Using an ugly wget hack) |
|
19:35
π
|
yipdw |
on the other hand humans will be looking at at least part of this dataset and presentation then matters quite a bit |
|
20:21
π
|
balrog_ |
hey all |
|
20:21
π
|
balrog_ |
SketchCow: you there? |
|
23:16
π
|
salvo |
!list |
|
23:38
π
|
chronomex |
we should have an automatic kick on !list |
|
23:38
π
|
chronomex |
or maybe an autoresponder with our list of torrents |
|
23:56
π
|
dnova |
the.GIMP.ultimate.inc.keygen.DiViNiTy |
|
23:59
π
|
PatC |
Anyone know if 28c3 is up on archive.org yet? |