Time |
Nickname |
Message |
00:03
🔗
|
db48x |
put a zfs filesystem image on the cd |
00:03
🔗
|
db48x |
problem solved |
00:08
🔗
|
alard |
SketchCow: desktop.google.com has arrived on batcave. |
00:09
🔗
|
SketchCow |
Thanks, alard. |
00:13
🔗
|
balrog |
anyone here messing with zfs for mac? |
00:13
🔗
|
balrog |
(the tenscomplement port) |
00:16
🔗
|
SketchCow |
I so don't trust zfs |
00:18
🔗
|
db48x |
SketchCow: oh? |
00:18
🔗
|
db48x |
it just puts your data into a merkle tree, which is super awesome |
01:19
🔗
|
DFJustin |
I'm a huge fan of the archive.org online reader, I wish there was a desktop version |
01:19
🔗
|
chronomex |
it relies on browser image scaling, which varies a lot and can be lame |
01:20
🔗
|
DFJustin |
true, that only seems to be an issue for 1-bit stuff though |
01:34
🔗
|
db48x |
merkle trees are like pixie dust; you basically can't go wrong |
01:47
🔗
|
bsmith093 |
is the steve meretsky archive up yet? |
02:05
🔗
|
closure |
"We believe that all of the Early Journal Content is out of copyright." -- JSTOR "Additional uses are allowed, including the ability to download, share, and reuse the content for any non-commercial purpose." -- JSTOR .. um, if it's out of copyright, who the fuck do they think they are slapping these restrictions on it? |
02:06
🔗
|
* |
chronomex shrugs |
02:06
🔗
|
chronomex |
well, they're not 100% sure it's 100% out of copyright |
02:31
🔗
|
godane |
i have a feeling backing up something like reddit will be a problem |
02:32
🔗
|
godane |
only cause images are linked to other sites |
02:33
🔗
|
godane |
so to archive reddit we would have to archive also the exterinal link too |
04:31
🔗
|
db48x2 |
comcast-- |
05:15
🔗
|
Wyatt |
Oh dear, what did they do this time? |
05:17
🔗
|
db48x2 |
left me offline for 12 hours, then couldn't explain why it just started working again while I was talking to support |
05:18
🔗
|
Wyatt |
Sounds like what we've come to expect from them. |
06:59
🔗
|
ersi |
5.0G www.instructables.com/ |
06:59
🔗
|
ersi |
Growin' and growin' |
07:00
🔗
|
Wyatt |
ersi: Does it work to just wget that? |
07:05
🔗
|
ersi |
Yeah |
07:06
🔗
|
ersi |
Or well, it *seems* to work. I'm going to check through what I get though |
07:08
🔗
|
ersi |
This is one massive site though, with mostly internal links |
07:11
🔗
|
Wyatt |
Hmm, think it would work for ehow? |
07:11
🔗
|
Wyatt |
Or is ehow already crawled by ia_archiver? |
07:14
🔗
|
ersi |
Wyatt: Doesn't seem crawled by ia_archiver at all when I visited http://liveweb.archive.org/www.ehow.com |
07:14
🔗
|
ersi |
neither was instructables btw ;) |
07:18
🔗
|
Wyatt |
Ominous. |
07:36
🔗
|
SketchCow |
Hey hey. |
07:37
🔗
|
SketchCow |
I finally game to negotiations with the developer set who found I was choking archive.org |
07:37
🔗
|
SketchCow |
So yay? |
07:38
🔗
|
db48x2 |
developer set? |
07:39
🔗
|
SketchCow |
Set of developers who were finding I was choking things. |
07:39
🔗
|
SketchCow |
To be honest, OCR is a bottleneck I don't like existing. |
07:39
🔗
|
SketchCow |
Add more OCRs |
07:39
🔗
|
SketchCow |
Everything else is going fine. |
07:40
🔗
|
SketchCow |
I'm getting into a useless twitter fight with some fathead |
07:40
🔗
|
db48x2 |
heh |
07:40
🔗
|
SketchCow |
I finally got the digitizer rig going |
07:41
🔗
|
SketchCow |
GDC tapes. I need to be digitizing at the rate of 15-20 a day. |
07:41
🔗
|
SketchCow |
One ends.... next one. |
07:41
🔗
|
SketchCow |
Just keep going |
07:41
🔗
|
SketchCow |
In middle of month, they send me money to buy a second one |
07:41
🔗
|
SketchCow |
It'll render. |
07:41
🔗
|
SketchCow |
And we'll kill these fuckers |
07:41
🔗
|
db48x2 |
sweet |
07:43
🔗
|
ersi |
buy a second what? |
07:43
🔗
|
ersi |
oh, digitizer rig |
07:45
🔗
|
db48x2 |
SketchCow: so the second question is "game"? |
07:48
🔗
|
SketchCow |
? |
07:48
🔗
|
db48x2 |
"<SketchCow> I finally game to negotiations..." |
07:49
🔗
|
db48x2 |
anyway |
07:50
🔗
|
SketchCow |
SAFE. So safe you wouldn't believe it. |
07:50
🔗
|
SketchCow |
root@teamarchive-0:/3/TIMAGS/super99# ~jscott/isitsafe |
07:50
🔗
|
ersi |
replace game with came, and it'll make more sense |
07:50
🔗
|
SketchCow |
Yes, I wrote a script that asks if the queue can handle me. |
07:51
🔗
|
db48x2 |
rsync to batcave finally started up again |
07:51
🔗
|
db48x2 |
SketchCow: lol |
07:51
🔗
|
db48x2 |
ersi: oh, I suppose if negotiations is an event |
07:51
🔗
|
db48x2 |
but then I would have expected "went" |
07:51
🔗
|
db48x2 |
anyway |
07:52
🔗
|
SketchCow |
http://www.archive.org/details/fox40newsaug222011 |
07:52
🔗
|
SketchCow |
Entertainment for you |
07:52
🔗
|
db48x2 |
ooh |
07:53
🔗
|
ersi |
Hm, wonder if I should have thrown on more parameters to wget before starting this :| |
07:53
🔗
|
db48x2 |
ersi: -D |
07:54
🔗
|
SketchCow |
* Closing connection #0 |
07:54
🔗
|
SketchCow |
< |
07:54
🔗
|
SketchCow |
< Connection: close |
07:54
🔗
|
SketchCow |
< Content-Length: 0 |
07:54
🔗
|
SketchCow |
< Content-Type: text/plain |
07:54
🔗
|
db48x2 |
ersi: --warc-file |
07:54
🔗
|
SketchCow |
root@teamarchive-0:/3/TIMAGS/super99# ~jscott/isitsafe |
07:54
🔗
|
SketchCow |
SAFE. So safe you wouldn't believe it. |
07:54
🔗
|
SketchCow |
Tah dah, it says I didn't break it! |
07:54
🔗
|
db48x2 |
heh |
07:55
🔗
|
ersi |
db48x2: So the answer is 'yes, I should have' |
07:55
🔗
|
db48x2 |
there's probably always another option you could throw on there |
07:55
🔗
|
ersi |
like -k? for converting teh links |
07:55
🔗
|
db48x2 |
yes |
07:55
🔗
|
db48x2 |
and -K to save a copy of the original from before it munged the links |
07:56
🔗
|
ersi |
well, dang. |
07:56
🔗
|
db48x2 |
heh |
07:59
🔗
|
SketchCow |
At one point in this talk, Will Wright shows a self-riding motorcycle |
07:59
🔗
|
SketchCow |
It's hilarious |
07:59
🔗
|
SketchCow |
Running around a park scaring people |
07:59
🔗
|
db48x2 |
heh |
07:59
🔗
|
db48x2 |
he seems like a pretty crazy guy |
08:01
🔗
|
db48x2 |
does <META NAME='ROBOTS' CONTENT='NOARCHIVE'> work against wget even when you do -e robots=off? |
08:01
🔗
|
SketchCow |
Not sure |
08:02
🔗
|
db48x2 |
oh, interesting |
08:02
🔗
|
db48x2 |
this time it crashed |
08:03
🔗
|
SketchCow |
I don't know if I showed this script I run. |
08:03
🔗
|
db48x2 |
aha |
08:03
🔗
|
SketchCow |
root@teamarchive-0:/3/TIMAGS/smartprogrammer# ./ingestor SmartProgrammer_1984_02.pdf |
08:03
🔗
|
SketchCow |
OK, then, SmartProgrammer_1984_02.pdf gets the love. |
08:03
🔗
|
SketchCow |
Here's what I plan to do. |
08:03
🔗
|
db48x2 |
I was telling it to mirror fanfiction.net, but it redirects to www.fanfiction.net |
08:03
🔗
|
SketchCow |
In the collection named smart-programmer-newsletter... |
08:03
🔗
|
SketchCow |
I will add an item called smart-programmer-newsletter-1984-02. |
08:03
🔗
|
SketchCow |
I will say this dates to 1984-02. |
08:03
🔗
|
SketchCow |
I will give it the title of The Smart Programmer Newsletter (February 1984). |
08:04
🔗
|
SketchCow |
.. |
08:04
🔗
|
SketchCow |
It looked at SmartProgrammer_1984_02.pdf to figure it out. |
08:04
🔗
|
SketchCow |
That's test mode |
08:04
🔗
|
SketchCow |
It tells me it's working. |
08:04
🔗
|
db48x2 |
sweet |
08:04
🔗
|
SketchCow |
There are 18 issues. |
08:05
🔗
|
SketchCow |
Running. |
08:05
🔗
|
alard |
db48x2: I think wget doesn't listen to robots noarchive at all. It only understands nofollow. |
08:05
🔗
|
SketchCow |
It uploads each issue in roughly 8 seconds. |
08:05
🔗
|
db48x2 |
alard: good to know |
08:07
🔗
|
SketchCow |
Done. |
08:07
🔗
|
SketchCow |
18 issues in what, 2 minutes. |
08:08
🔗
|
db48x2 |
SketchCow: what do you use for downloading them? |
08:12
🔗
|
db48x2 |
doh |
08:12
🔗
|
SketchCow |
UNSAFE. Current OCR count is 207. |
08:12
🔗
|
SketchCow |
root@teamarchive-0:/3/TIMAGS# ~jscott/isitsafe |
08:13
🔗
|
db48x2 |
1am already |
08:13
🔗
|
SketchCow |
Oh no! |
08:13
🔗
|
db48x2 |
time to put more machines on the task of misreading the text in magazines |
08:20
🔗
|
SketchCow |
Yeah! |
08:25
🔗
|
* |
db48x2 is watching Time's Arrow |
08:37
🔗
|
kin37ik |
hullo |
08:38
🔗
|
ersi |
Hi |
08:38
🔗
|
SketchCow |
So, I want to throw Atari Force up there. |
08:38
🔗
|
SketchCow |
But Atari Force is a DC comic book |
08:38
🔗
|
SketchCow |
A super defunct one, but still |
08:39
🔗
|
SketchCow |
So as awesome as it is, I don't think it'll count right now. |
08:41
🔗
|
SketchCow |
But this? |
08:41
🔗
|
SketchCow |
http://www.bombjack.org/commodore/commodore/ |
08:41
🔗
|
SketchCow |
As soon as it finishes downloading, it goes up. |
08:41
🔗
|
SketchCow |
Fwip |
08:42
🔗
|
kin37ik |
woah |
08:54
🔗
|
josephwdy |
Michael S. Hart is dead .... |
08:55
🔗
|
Wyatt |
So how good is httrack for mirroring things really? |
08:56
🔗
|
josephwdy |
it's kinda shitty |
08:56
🔗
|
josephwdy |
good for small projects |
08:56
🔗
|
kin37ik |
crap, hit a snag with fortunecity.com |
08:57
🔗
|
Wyatt |
Really? Damn. |
08:58
🔗
|
Wyatt |
Funny, I had completely forgotten about fortunecity, too. |
08:58
🔗
|
josephwdy |
Nothing really good on windows for ripping a site, but if your on linux wget or curl is really good. |
08:59
🔗
|
kin37ik |
ive been doing some poking around in it, and found their directory structure to be.....not quite as i expected on fortune city |
09:00
🔗
|
Wyatt |
josephwdy: Yeah, they're utilities useful in proportion to the length of their man pages. |
09:00
🔗
|
Wyatt |
But their man pages are...short story-length. |
09:03
🔗
|
Wyatt |
What options are good? looks like wget -mkKe robots=off --warc-file from just the past few bits of history |
09:03
🔗
|
db48x2 |
-E |
09:03
🔗
|
db48x2 |
--mirror |
09:04
🔗
|
db48x2 |
--wait |
09:04
🔗
|
db48x2 |
--random-wait |
09:04
🔗
|
db48x2 |
-p --protocol-directories -np --follow-ftp --progress=dot:decimal --warc-file --warc-cdx --warch-header --user-agent |
09:05
🔗
|
db48x2 |
the --warc options require a special build of wget which you'll find on the wiki |
09:05
🔗
|
kin37ik |
crap, now im stuck.... |
09:05
🔗
|
db48x2 |
they cause it to record an archive that contains not just the files retrieved, but the http request and response headers that lead to the files themselves |
09:06
🔗
|
SketchCow |
OK, here we go. |
09:06
🔗
|
SketchCow |
Michael S. Hart is dead and we will miss him. |
09:06
🔗
|
SketchCow |
Only got to meet him once. |
09:07
🔗
|
chronomex |
kin37ik: recording headers as db48x2 recommends is the ideal; for some time we mirrored without doing it but now we do when possible |
09:07
🔗
|
kin37ik |
chronomex: dont you mean Wyatt, and not me? lol |
09:07
🔗
|
chronomex |
um, right. |
09:07
🔗
|
chronomex |
I'm not sober. |
09:07
🔗
|
kin37ik |
lol |
09:08
🔗
|
josephwdy |
SketchCow: that's pretty awesome :D do tell more. |
09:08
🔗
|
SketchCow |
DRUNKIVING |
09:08
🔗
|
chronomex |
DRUNKIRCING |
09:08
🔗
|
SketchCow |
DON'T DRINK AND DERIVE |
09:08
🔗
|
chronomex |
I actually don't know how to drive. |
09:08
🔗
|
Wyatt |
Drunk Relay Chat |
09:08
🔗
|
kin37ik |
lol |
09:09
🔗
|
SketchCow |
There it goes! |
09:09
🔗
|
SketchCow |
Adding 156 books |
09:09
🔗
|
josephwdy |
Wyatt: the at wiki has a good starting point http://archiveteam.org/index.php?title=Wget |
09:09
🔗
|
Wyatt |
Yeah, thanks. I was just looking over that. |
09:09
🔗
|
chronomex |
SketchCow: you still need to hook me up with your adder thing. |
09:09
🔗
|
SketchCow |
http://www.archive.org/details/commodore-manuals |
09:09
🔗
|
SketchCow |
chronomex: Yes |
09:10
🔗
|
Wyatt |
Sometimes I forget that there _are_ good resources for this stuff. |
09:11
🔗
|
ersi |
Ideally, wouldn't one want; A) 'just a plain wget' mirroring of the site, no modification B) modified links wget mirroring C) a WARC kind of wget mirroring= |
09:11
🔗
|
ersi |
s//=//?/ |
09:11
🔗
|
db48x2 |
ersi: ues |
09:11
🔗
|
SketchCow |
Ideally, you want both |
09:11
🔗
|
SketchCow |
But sometimes, no choice |
09:11
🔗
|
db48x2 |
-k and -K get you a modified and unmodified mirror |
09:11
🔗
|
chronomex |
both three? |
09:11
🔗
|
ersi |
Ah, true. |
09:11
🔗
|
db48x2 |
and the --warc gets you the archive |
09:11
🔗
|
SketchCow |
Shut up, drunky |
09:12
🔗
|
ersi |
does warc make 'archives'? |
09:12
🔗
|
db48x2 |
yea, after a fashion |
09:12
🔗
|
chronomex |
SketchCow: I'M drunk?!? |
09:12
🔗
|
ersi |
db48x2: Hm? |
09:12
🔗
|
db48x2 |
it's not a tarball |
09:12
🔗
|
ersi |
I mean, like heritrex (or whatever it's called) |
09:12
🔗
|
Wyatt |
You said there's a patch for warc on the wik? |
09:12
🔗
|
db48x2 |
yea, very similar to what heritrix does |
09:12
🔗
|
ersi |
similar being compatible? |
09:13
🔗
|
db48x2 |
when they wrote heritrix they invented the arc format |
09:13
🔗
|
db48x2 |
it's been updated |
09:13
🔗
|
db48x2 |
I don't know the exact timeline |
09:13
🔗
|
ersi |
So it makes 'old version WARC archives'? |
09:13
🔗
|
SketchCow |
http://www.youtube.com/watch?v=xDjOr68VxKw |
09:13
🔗
|
SketchCow |
Go watch that |
09:14
🔗
|
db48x2 |
http://archiveteam.org/index.php?title=Wget_with_WARC_output |
09:14
🔗
|
kin37ik |
hmm, heres a problem, if i poke members.fortunecity.com, ill get all the dir files on that domain but wont get any of the members subsites as they arent linked, how could i get around that to poke the member accounts?? |
09:14
🔗
|
chronomex |
"sites [...] run by habitual whiners, will complain when a site scraping uses 200 megabytes of transfer when it could have used 100." -- sites run by whiners bitch at EVERYTHING |
09:14
🔗
|
Wyatt |
Truth^ |
09:14
🔗
|
db48x2 |
once you create your warc file, you should append a record that contains the script you ran to grab the site, if it's more than a single invocation of wget |
09:15
🔗
|
ersi |
Let me add; derp |
09:16
🔗
|
ersi |
oh, alard wrote the --warc wget support |
09:16
🔗
|
db48x2 |
yea |
09:16
🔗
|
chronomex |
you will note, alard has an @ by his name |
09:16
🔗
|
ersi |
Oh, the headers is probably used for Wayback Machine to place it in the timeline |
09:17
🔗
|
ersi |
Historically, I had a @ by my name as well. |
09:17
🔗
|
chronomex |
I think it's just for the masturbatory completeness factor |
09:17
🔗
|
ersi |
</careface> :P |
09:17
🔗
|
chronomex |
fine. |
09:17
🔗
|
chronomex |
usually the @s occupy 1 of 6 nickname columns on my screen; we're running low. |
09:17
🔗
|
chronomex |
ish. |
09:19
🔗
|
ersi |
Man, I'd like to just ./mirror-archive-the-fuck-out-of-url <url> |
09:20
🔗
|
SketchCow |
I think it's obvious we're going to have to write a script set that does this. |
09:20
🔗
|
db48x2 |
I've been working on one |
09:20
🔗
|
ersi |
also, darn these dynamic pages that generate these weird files |
09:20
🔗
|
chronomex |
ersi: weird how? |
09:20
🔗
|
ersi |
trololo?COMMENTS=UPSIDEDOWN?&SORT=INMYPANTS |
09:21
🔗
|
chronomex |
what's the problem with that? |
09:21
🔗
|
chronomex |
that's the bit after the last / in the url |
09:21
🔗
|
chronomex |
is filename. |
09:21
🔗
|
ersi |
None really, besides that it bothers me and feels naughty |
09:21
🔗
|
chronomex |
unix is okay with it, right? |
09:21
🔗
|
ersi |
Right. |
09:21
🔗
|
db48x2 |
you can use -E |
09:21
🔗
|
chronomex |
if it's okay with unix, it's okay with chronomex |
09:21
🔗
|
db48x2 |
it'll slap a .html on the end of all that |
09:22
🔗
|
ersi |
yeah, but I didn't do that :) |
09:22
🔗
|
ersi |
I'm unsure if I should CONTINUE RAPING or STOP and modify my parameters |
09:23
🔗
|
db48x2 |
indeed |
09:23
🔗
|
db48x2 |
a dilemma for the ages |
09:24
🔗
|
ersi |
If I let it run, i'll get a feel for if they use other domains for CDN or trickery and possibly total size of site |
09:25
🔗
|
chronomex |
this is instructables, right? |
09:26
🔗
|
ersi |
Yes. It's probably effing huge |
09:26
🔗
|
ersi |
It's up at 6GB currently |
09:27
🔗
|
chronomex |
ahyeah. |
09:28
🔗
|
chronomex |
god I hate it when people who are insane but kind of interesting email me |
09:29
🔗
|
Wyatt |
Seems like a generalised distributed parallel archival-quality...I hesitate to say "bandwidth fucker" because it's awfully uncouth. |
09:29
🔗
|
Wyatt |
But yes, challenging, but boy would it be useful. |
09:33
🔗
|
* |
kin37ik is getting frustrated |
09:34
🔗
|
ersi |
chronomex: Do you get lots of insane interesting people mailing you? :) |
09:35
🔗
|
chronomex |
no, for the most part, it's confused transsexual folk who think I care. |
09:35
🔗
|
chronomex |
responding to this one with "This sounds like something one would ask a lover. Before you proceed any further, ask yourself the following question: Is chronomex my lover?" |
09:35
🔗
|
kin37ik |
lol |
09:37
🔗
|
ersi |
Wyatt: My continuation of questionabe quality archival effort? |
09:37
🔗
|
chronomex |
ersi: what would you change? |
09:38
🔗
|
SketchCow |
OK, who wants a short project |
09:38
🔗
|
Wyatt |
ersi: In a sense? I'm saying it would be nice to spread the love around |
09:38
🔗
|
SketchCow |
http://census.ire.org/ |
09:38
🔗
|
ersi |
Well, I'd throw on -kK and perhaps some more |
09:39
🔗
|
SketchCow |
Turn that into an "item", a collection that makes sense. |
09:39
🔗
|
SketchCow |
Module threw exception: |
09:39
🔗
|
SketchCow |
item must be OCR'd via auto_submit |
09:40
🔗
|
SketchCow |
That's interesting. |
09:40
🔗
|
ersi |
Wouldn't the "raw data datasets" from the bottom of http://census.ire.org/data/bulkdata.html be good candidates? |
09:40
🔗
|
chronomex |
SketchCow: how is this better than the data on census.gov? |
09:40
🔗
|
SketchCow |
I am not clear at all it is. |
09:40
🔗
|
chronomex |
I'm not seeing any real value add, except a shinier interface |
09:40
🔗
|
SketchCow |
If that's the case, I trust that opinion. |
09:41
🔗
|
chronomex |
I've spent a good deal of time working with census data; I practically majored in that shit. |
09:41
🔗
|
ersi |
:o |
09:41
🔗
|
chronomex |
geography is a lot to do with demography |
09:42
🔗
|
chronomex |
https://github.com/ireapps/census yeah, it's a fancy interface to census data |
09:43
🔗
|
chronomex |
ersi: -k is not archive-safe, unless combined with -K. |
09:43
🔗
|
ersi |
That's why I'd do -kK |
09:43
🔗
|
chronomex |
-K means some extra work to get an archive-safe version |
09:44
🔗
|
chronomex |
what other flags were you thinking of? |
09:45
🔗
|
ersi |
well, I'd consider building alard's patched wget version and do WARC perhaps |
09:45
🔗
|
chronomex |
warc is good |
09:45
🔗
|
chronomex |
can you combine --continue with --warc ? |
09:45
🔗
|
SketchCow |
http://www.archive.org/details/commodore-manuals |
09:45
🔗
|
SketchCow |
aww yeah! |
09:46
🔗
|
ersi |
maybe add some domains to -D |
09:46
🔗
|
chronomex |
SketchCow: color monitor service manual?!? fuck yeah! |
09:46
🔗
|
ersi |
Hm, maybe |
09:46
🔗
|
SketchCow |
See, these are all useful things |
09:46
🔗
|
ersi |
But I'd rather do a full blown new run with --warc |
09:46
🔗
|
SketchCow |
That have been around a long time |
09:47
🔗
|
SketchCow |
But they're going to be consolidated now. |
09:47
🔗
|
chronomex |
ersi: right, just wondering. remember, alcohol. |
09:47
🔗
|
ersi |
Also, change the useragent to Firefox or something instead of Googlebot |
09:47
🔗
|
ersi |
maybe I'm getting 'GBot customised' versions of pages :/ |
09:47
🔗
|
db48x2 |
yea, that helps a lot |
09:47
🔗
|
chronomex |
ersi: or "ARCHIVETEAM FUCKYOUBOT" |
09:48
🔗
|
SketchCow |
ArchiveTeam 1.0/Bitch I'm a Bus |
09:48
🔗
|
kin37ik |
SketchCow: might just grab a copy of all of those and store them away somewhere |
09:48
🔗
|
chronomex |
"ARCHIVETEAM FUCKYOUBOT 3.6" |
09:48
🔗
|
ersi |
currently running with; wget -m -c -p -e robots=off http://www.instructables.com/index --user-agent="Googlebot/2.1 (+http://www.google.com/bot.html)" |
09:48
🔗
|
db48x2 |
to be honest we ought to archive with lots of different user agents, to make sure |
09:48
🔗
|
db48x2 |
ersi: --mirror |
09:48
🔗
|
ersi |
-m == --mirror |
09:48
🔗
|
chronomex |
db48x2: this sounds like "wget replacement project" to me |
09:48
🔗
|
db48x2 |
oh, right |
09:48
🔗
|
chronomex |
wget is great but it's not the ultimate spider. |
09:48
🔗
|
SketchCow |
We've moved in the last few months from panic downloads to proactives. |
09:48
🔗
|
ersi |
I like 'em short parameters |
09:49
🔗
|
SketchCow |
Proactives, I am fine with 5 400mb .tar.gz files, representing different approaches. |
09:49
🔗
|
ersi |
I really do hope AutoCAD will take great care of Instructables.. but.. Trust No One. |
09:49
🔗
|
SketchCow |
I just don't want to lose stuff that's time critical. |
09:49
🔗
|
ersi |
SketchCow: This bitch be huge though |
09:49
🔗
|
db48x2 |
size isn't an issue |
09:49
🔗
|
SketchCow |
My opinion, which I told Bre, is that AutoCAD will buy Makerbot within 4-5 years |
09:49
🔗
|
ersi |
It can complicate things :) |
09:49
🔗
|
chronomex |
SketchCow: that would be very interesting |
09:49
🔗
|
Wyatt |
Size is an issue when we've only got two weeks to get all of it. |
09:50
🔗
|
ersi |
SketchCow: Does not sound unlikely. Since they bought Instructables for the exactly same reason they would buy Makerbot |
09:50
🔗
|
chronomex |
SketchCow: my personal opinion? makerbot is in violation of its lease, which says "robots made must obey asimov's 3 laws". I've had my fingers burned by a makerbot. |
09:50
🔗
|
ersi |
lol |
09:50
🔗
|
Wyatt |
Was that the makerbot's fault? |
09:50
🔗
|
db48x2 |
yes |
09:50
🔗
|
SketchCow |
Yeah, seriously |
09:50
🔗
|
db48x2 |
it let him get injured |
09:50
🔗
|
chronomex |
yes, it went down when it ought have gone up because my fingers were there! |
09:51
🔗
|
SketchCow |
If I'm canoeing with you, and you're a fuck and fall over and drown |
09:51
🔗
|
SketchCow |
Which is within Jason Scott's Three Laws of Robotics |
09:51
🔗
|
SketchCow |
2. You die, I get your wallet |
09:51
🔗
|
SketchCow |
1. I didn't know him, officer |
09:51
🔗
|
chronomex |
wait wait wait |
09:51
🔗
|
SketchCow |
3. If our size is the same, hey, you died naked for whatever reason |
09:51
🔗
|
chronomex |
you're a robot? |
09:51
🔗
|
Wyatt |
I KNEW there was a reason I don't carry cash! And all this time, I thought it was roaming bands of thugs. |
09:52
🔗
|
SketchCow |
So we're using the DEFCON speech to apply to TED |
09:52
🔗
|
SketchCow |
The question is, can they get an adequate idea I could do a TED speech when half the words are profanity |
09:52
🔗
|
SketchCow |
We'll see!! |
09:53
🔗
|
ersi |
Oh fuck, that'd be great |
09:54
🔗
|
SketchCow |
Attend TEDActive 2012 in Palm Springs |
09:54
🔗
|
SketchCow |
Held in Palm Springs, TEDActive is a parallel event held at the same time as TED in Long Beach, featuring the simulcast of the conference. Get the benefits of the TED Book Club, conference video archives, online social networking, and many special offers (Learn more .). |
09:54
🔗
|
SketchCow |
Price: $3,750 |
09:54
🔗
|
SketchCow |
I wish I could afford TED |
09:55
🔗
|
SketchCow |
I already qualify as an insider |
09:55
🔗
|
Wyatt |
If you get in, you have to pull the "Fuck you, you are all in ArchiveTeam" bit. |
09:55
🔗
|
* |
db48x2 sighs |
09:55
🔗
|
db48x2 |
3am now |
09:55
🔗
|
SketchCow |
But I can't pay retail for that shit |
09:55
🔗
|
ersi |
They're expensive/costly as fuck |
09:55
🔗
|
SketchCow |
It was so great |
09:55
🔗
|
SketchCow |
I paid wholesale price |
09:55
🔗
|
SketchCow |
Still expensive |
09:55
🔗
|
Wyatt |
Bajeezus, though, that's worse than SXSW... |
09:55
🔗
|
SketchCow |
Worth every dime. |
09:55
🔗
|
SketchCow |
Every. Dime. |
09:55
🔗
|
ersi |
I did get to watch TED live for free last year |
09:55
🔗
|
SketchCow |
Retail is $7,500 |
09:56
🔗
|
ersi |
(I also RTMPDumped the shit out of the stream) |
09:56
🔗
|
SketchCow |
I harassed one of the google founders (Page) for 40 seconds. |
09:56
🔗
|
SketchCow |
Come on, that was worth it right there |
09:56
🔗
|
db48x2 |
http://pastebin.com/8EDZBLE0 |
09:56
🔗
|
chronomex |
hahahahaha SketchCow |
09:57
🔗
|
SketchCow |
I demanded he buy 4chan through a shell company |
09:57
🔗
|
SketchCow |
This was before canv.as of course. |
09:57
🔗
|
SketchCow |
Shook Bill Gates' hand, had a long talk with The Amazing Randi |
09:57
🔗
|
SketchCow |
Come on, so worth it |
09:58
🔗
|
SketchCow |
Also surprised by the people who knew me on sight |
09:58
🔗
|
SketchCow |
Like Wozniak |
09:58
🔗
|
SketchCow |
Anyway, I'm applying |
09:58
🔗
|
SketchCow |
With some help |
09:58
🔗
|
SketchCow |
If I get in, you'll see probably a 7 or 12 minute version of that speech |
09:59
🔗
|
db48x2 |
I've got another script that does a zfs snapshot |
09:59
🔗
|
chronomex |
db48x2: you want me to run that pastebin? |
09:59
🔗
|
db48x2 |
runs this script and then takes a zfs snapshot to preserve it |
09:59
🔗
|
db48x2 |
chronomex: this is just the script that I'm working on |
10:00
🔗
|
db48x2 |
you have to customize it per site, of course |
10:00
🔗
|
chronomex |
right. |
10:00
🔗
|
db48x2 |
for GoogleFriendsNewsletter: |
10:00
🔗
|
db48x2 |
grab -a log http://groups.google.com/group/google-friends/download?s=pages -O google-friends-pages.zip |
10:00
🔗
|
db48x2 |
mirror -a log "${SITE2}" |
10:00
🔗
|
db48x2 |
mirror -o log "${SITE}" |
10:00
🔗
|
db48x2 |
grab -a log http://groups.google.com/group/google-friends/download?s=files -O google-friends-files.zip |
10:00
🔗
|
db48x2 |
etc |
10:01
🔗
|
db48x2 |
so it's not really as simple as it ought to be, I guess |
10:01
🔗
|
db48x2 |
but I could make those command line args |
10:01
🔗
|
SketchCow |
http://www.guardian.co.uk/books/2011/sep/07/michael-moore-hated-man-america |
10:01
🔗
|
SketchCow |
makeup? makeup. |
10:02
🔗
|
db48x2 |
--mirror http://wherever/ --mirror http://another/ --grab http://some/file |
10:02
🔗
|
ersi |
Lol! Nice that Wozy recognized ya' :) |
10:08
🔗
|
SketchCow |
OK, bed |
10:09
🔗
|
Wyatt |
'Night |
10:09
🔗
|
db48x2 |
Time's Arrow is a pretty good episode |
10:09
🔗
|
db48x2 |
it's got everything |
10:10
🔗
|
ersi |
Time's Arrow? |
10:11
🔗
|
db48x2 |
severed heads, time travel, body snatchers, robots, historical figures |
10:11
🔗
|
db48x2 |
ersi: Star Trek: TNG episode |
10:11
🔗
|
ersi |
oh, heh |
10:11
🔗
|
db48x2 |
S05E26 and S06E01 |
10:12
🔗
|
db48x2 |
they find Data's 500-year-old severed head in a mine under San Francisco |
10:12
🔗
|
db48x2 |
hijinks ensure |
10:15
🔗
|
chronomex |
yeah that was kind of strange. |
10:16
🔗
|
kin37ik |
right, now that im off the phone i need to figure out this dir |
10:27
🔗
|
kin37ik |
how do i get Wget to fetch and grab the user/member subsites if they arent linked somewhere on fortunecity for Wget to follow? |
10:27
🔗
|
db48x2 |
you have to find out the usernames |
10:27
🔗
|
db48x2 |
feed them to wget |
10:27
🔗
|
kin37ik |
thats the problem, i need to fetch all the usernames, as far as ive worked out so far |
10:28
🔗
|
db48x2 |
yep |
10:28
🔗
|
kin37ik |
members.fortunecity.com contains all the member pages but none of those member pages are actually linked in the members.fortunecity.com/ directory |
10:33
🔗
|
kin37ik |
if you were to poke for all potential user accounts, how would you go about it? |
10:34
🔗
|
chronomex |
when we scraped geocities, we did google site: searches for all the words in the dictionary and pulled out the urls |
10:35
🔗
|
chronomex |
it's kind of icky but it works |
10:35
🔗
|
kin37ik |
hmm |
10:39
🔗
|
kin37ik |
i dont know how well that would work on fortunecity |
10:39
🔗
|
kin37ik |
that would probably hit well over half but then obtaining the rest |
10:39
🔗
|
chronomex |
how many do you have now? |
10:40
🔗
|
kin37ik |
at the moment, ive only hit a few user accounts, and then the directory structure just started getting a bit funky |
10:41
🔗
|
alard |
The wayback machine can also give you a list: http://wayback.archive.org/web/*/http://members.fortunecity.com/* |
10:41
🔗
|
chronomex |
so, half would be an improvement. |
10:41
🔗
|
alard |
(But that of course will only give you things that are already archived.) |
10:42
🔗
|
kin37ik |
alard: yes, but that still helps |
10:42
🔗
|
kin37ik |
they have a bit of a weird dir structure, not only do they keep the user accounts at something like, for example members.fortunecity.com/user0001/index.html |
10:43
🔗
|
kin37ik |
but they are also doing the dir in dir as well so like, members.fortunecity.com/millenium/baloons/1035/index.html sort of thing |
11:00
🔗
|
kin37ik |
ouch, that doesnt help.... |
11:13
🔗
|
kin37ik |
id better write all this down |
12:10
🔗
|
ersi |
http://feedproxy.google.com/~r/hackaday/LgoM/~3/rMV2Fqe2uao/ |
12:10
🔗
|
ersi |
oh fuck you google, mangling urls |
12:10
🔗
|
ersi |
http://hackaday.com/2011/09/08/recovering-data-for-a-homemade-cray/ * |
12:10
🔗
|
ersi |
Fentons cray recovery thingie majingy :) |
12:20
🔗
|
Soojin |
cool :) |
12:53
🔗
|
SpaceCore |
Afternoon |
12:53
🔗
|
* |
SpaceCore reads backlog |
12:54
🔗
|
SpaceCore |
ersi: need any help with that? |
13:01
🔗
|
ersi |
Hm? |
13:01
🔗
|
ersi |
with instructables? |
13:03
🔗
|
SpaceCore |
yeah |
13:05
🔗
|
ersi |
I dunno, I got a process running along nicely - hopefully it's useful data :P |
13:05
🔗
|
SpaceCore |
Ok |
13:05
🔗
|
* |
SpaceCore goes back to attempting to rebuild his netbook |
13:20
🔗
|
emijrp |
can we archive Michael S. Hart plox? |
13:40
🔗
|
emijrp |
Sep 6, 2011 - On Sep 3rd (just before the long labor day weekend), WebCite went down due to a hardware failure. While we are restoring the database from our backups, no new snapshots can be made, and old snapshots may be temporarily unavailable. We apologize for any inconvenience caused. |
13:41
🔗
|
emijrp |
http://www.webcitation.org/archive.php |
15:30
🔗
|
SketchCow |
Oh, web citation |
15:33
🔗
|
DFJustin |
<ersi> Wyatt: Doesn't seem crawled by ia_archiver at all when I visited http://liveweb.archive.org/www.ehow.com |
15:34
🔗
|
DFJustin |
you're doing it wrong |
15:34
🔗
|
DFJustin |
need an http:// before www.ehow.com |
15:45
🔗
|
SketchCow |
http://www.archive.org/details/commodore-manuals |
16:04
🔗
|
lowtekk |
cool, im sure youve already got any commodore manual I do, but i'll check |
16:07
🔗
|
SketchCow |
Well, if I DON'T, then yes, it would be good of the world to put that together. |
16:32
🔗
|
emijrp |
SketchCow: what is the status of jamendo downloading? |
16:34
🔗
|
SketchCow |
Stops and starts. |
16:34
🔗
|
SketchCow |
It times out and dies constantly. |
16:35
🔗
|
emijrp |
but you have to restart or it auto resumes? |
16:35
🔗
|
SketchCow |
I have to restart it, and I resume it by knowing when it last died. |
16:35
🔗
|
emijrp |
ok |
16:52
🔗
|
ersi |
DFJustin: yeah yeah, i wrote that manually here |
17:16
🔗
|
godane |
just download floss weely 114 |
17:17
🔗
|
godane |
slowly getting old twit.tv show |
17:20
🔗
|
sep332 |
I heard a rumor that archiveteam is doing something with the Yahoo Video archive soon? Is that true? |
17:21
🔗
|
SketchCow |
I'm uploading it |
17:24
🔗
|
sep332 |
I have a 385GB slice of it, users# 1,300,000 - 1,400,000 |
17:24
🔗
|
sep332 |
do you have those already? |
17:24
🔗
|
SketchCow |
I want it. |
17:25
🔗
|
SketchCow |
I have to head out, but I am for it. |
17:25
🔗
|
sep332 |
OK, it will be about 6 hours before I can get to them, but I'll put them wherever you want. |
17:26
🔗
|
SketchCow |
Ok, mail jason@textfiles.com, I'll set up an rsync slot |
17:26
🔗
|
sep332 |
ok cool, thanks |
17:29
🔗
|
godane |
i hope this comes out in 5 years: http://en.wikipedia.org/wiki/Stacked_Volumetric_Optical_Disk |
17:30
🔗
|
godane |
one layer equals about 2.4TB |
17:30
🔗
|
godane |
and it can have 100x or more layers |
17:32
🔗
|
godane |
more likely a better optical disc then hvd or 5D dvd since the most these will save is 6tb to 10tb max |
17:34
🔗
|
sep332 |
I hope its not a disk-shape, I'm sick of discs |
17:34
🔗
|
sep332 |
how about a cube? or a nice hexagonal crystal |
17:34
🔗
|
godane |
only like for archive reasons |
17:34
🔗
|
closure |
pyramid power |
17:34
🔗
|
godane |
no write to the device |
17:35
🔗
|
godane |
just the speed that of the laser for SVOD will have very fast |
17:36
🔗
|
sep332 |
I think you can write with a holographic laser |
17:36
🔗
|
godane |
other wise it could take months just to burn it |
17:38
🔗
|
sep332 |
Kenwood had a 7-laser parallel CDROM reader back in 2001, http://hothardware.com/Reviews/Kenwoods-72X-True-X-CDROM-Drive/ |
17:38
🔗
|
sep332 |
I think we can do better :) |
18:15
🔗
|
Schbirid |
emijrp: does dumpgenerator.py do the actual downloading or does it generate a urllist? |
18:16
🔗
|
emijrp |
it downloads the text and images |
18:16
🔗
|
Schbirid |
nice |
18:16
🔗
|
Schbirid |
any idea what might be wrong if a wikia wiki is not in http://wiki-stats.wikia.com/ ? |
18:16
🔗
|
Schbirid |
i want to perserve quake.wikia.com |
18:17
🔗
|
emijrp |
wikia dumps are generated on demand |
18:18
🔗
|
emijrp |
you have to request it, but im not sure where |
18:18
🔗
|
Schbirid |
ah ok |
18:19
🔗
|
Schbirid |
the doom wiki was just forked to doomwiki.org |
18:19
🔗
|
Schbirid |
:) |
18:19
🔗
|
emijrp |
although you can try with dumpgenerator |
18:19
🔗
|
emijrp |
using http://quake.wikia.com/api.php |
18:19
🔗
|
emijrp |
i mean, it is better if wikia gives you the dump, but if you dont want to ask or wait, just use wikiteam tools |
18:20
🔗
|
Schbirid |
yeah |
18:20
🔗
|
Schbirid |
i shall try it :) |
18:20
🔗
|
Schbirid |
thanks! |
18:20
🔗
|
Schbirid |
we should totally sync our jamendo archives some day btw |
18:20
🔗
|
emijrp |
im downlloading incrementally |
18:21
🔗
|
Schbirid |
me too |
18:21
🔗
|
emijrp |
SketchCow too on IA |
18:21
🔗
|
Schbirid |
jamendo? |
18:21
🔗
|
emijrp |
yes |
18:21
🔗
|
Schbirid |
oh wow |
18:21
🔗
|
emijrp |
mp3 and ogg |
18:21
🔗
|
Schbirid |
i am still waiting for them to show why removed albums were removed otherwise i would have started doing that |
18:21
🔗
|
Schbirid |
i was in contact with IA about it once |
18:22
🔗
|
Schbirid |
jamendo offers to sync albums to servers as community hosted mirrors but they require you to run some python stuff iirc |
18:22
🔗
|
Schbirid |
there is one such server but the guys are hard to contact |
18:40
🔗
|
swebb1 |
I forgot that I had this: http://badcheese.com/~steve/crawl/ |
18:43
🔗
|
emijrp |
I am an archivist, and what is this? |
18:45
🔗
|
db48x |
swebb1: nifty |
18:45
🔗
|
emijrp |
wget http://badcheese.com/~steve/crawl/crawling.flv |
18:46
🔗
|
Schbirid |
http://www.onlineuniversity.net/1996-vs-2011/ |
18:47
🔗
|
Schbirid |
crap sory |
18:47
🔗
|
Schbirid |
infographic spam |
18:47
🔗
|
Schbirid |
go http://images.onlineuniversity.net.s3.amazonaws.com/96vs11.jpg instead |
18:47
🔗
|
emijrp |
1 petabyte = 74 terabytes? |
18:48
🔗
|
Schbirid |
yes, if you are a macfag and buy 1 petabyte you only get 74tb :) |
18:49
🔗
|
emijrp |
DRM included? |
18:51
🔗
|
Schbirid |
1050tb worth of it |
18:59
🔗
|
db48x |
I want a petabyte of storage in my apartment |
19:09
🔗
|
godane |
just need to get 250 4tb hard drivers thoughs come out |
19:09
🔗
|
godane |
cause 333 3tb hard drives is a very old number |
19:10
🔗
|
godane |
also saves space cause there will be fewer drives |
19:11
🔗
|
lowtekk |
start saving |
19:13
🔗
|
ersi |
stop shaving |
19:13
🔗
|
godane |
by the time you buy you there will be 8tb or 16tb drives |
19:14
🔗
|
godane |
bbl |
19:21
🔗
|
emijrp |
I have a petabyte in my PC. |
19:21
🔗
|
emijrp |
I signed up on Internet Archive. I can upload whatever I want. |
19:22
🔗
|
emijrp |
Cloud storage for free. HELL YEAH. |
19:22
🔗
|
emijrp |
Buy a good internet connection 100mbit and you have almost the same bus speed that local drives. |
19:24
🔗
|
emijrp |
You can do it too. But, you have to credit me. It was my idea. |
19:24
🔗
|
emijrp |
Thanks. |
19:25
🔗
|
Schbirid |
he he he |
19:25
🔗
|
Schbirid |
IA = cloud |
19:25
🔗
|
lowtekk |
i'm sure glad my hard drives are faster than 100mbit.... |
19:28
🔗
|
ersi |
Uh, I get around 300-550mbit to my drives |
19:28
🔗
|
ersi |
even more in my workstation |
19:28
🔗
|
lowtekk |
that's the idea :) |
19:30
🔗
|
lowtekk |
maybe it's ye olde ultra-dma drives he's talking about? |
19:31
🔗
|
Aranje |
my god, I'm looking at that infographic he posted and godaddy is still just as cluttered and confusing was it was in 1996 |
19:32
🔗
|
chronomex |
are you surprised? |
19:32
🔗
|
Aranje |
Not really >_> |
19:32
🔗
|
Aranje |
Alittle, I guess |
19:32
🔗
|
Aranje |
but I shouldn't be |
19:32
🔗
|
Aranje |
Then again, I've only had internet since 2005, so I wouldn't know what it looked like then |
19:32
🔗
|
ersi |
It's more suprising how those tards are still in business |
19:48
🔗
|
Schbirid |
emijrp: error on image retrieval or normal output http://pastebin.com/k8UxDwgD ? |
19:49
🔗
|
emijrp |
looks like it fails with images at wikia |
19:49
🔗
|
emijrp |
file a bug http://code.google.com/p/wikiteam/issues/list |
19:50
🔗
|
Schbirid |
it requires me to use a google account |
19:55
🔗
|
emijrp |
yes, blame to spammers |
19:56
🔗
|
emijrp |
im fixing the bug, wait |
19:57
🔗
|
Schbirid |
awesome! |
20:00
🔗
|
emijrp |
done |
20:00
🔗
|
emijrp |
do svn up |
20:01
🔗
|
emijrp |
and resume |
20:01
🔗
|
emijrp |
python dumpgenerator.py --api=... --xml --images --resume --path=pathtodirectory |
20:02
🔗
|
emijrp |
remove quakewikiacom-20110908-images.txt before to be sure |
20:02
🔗
|
Schbirid |
no such file |
20:02
🔗
|
emijrp |
ok |
20:02
🔗
|
Schbirid |
works |
20:02
🔗
|
Schbirid |
nice |
20:02
🔗
|
Schbirid |
thanks! |
20:03
🔗
|
emijrp |
: ) |
20:18
🔗
|
Aranje |
You know what site I'd actually like to have archived? project gutenberg. |
20:18
🔗
|
Aranje |
I should figure that out. |
20:21
🔗
|
db48x |
Aranje: download the dvd image |
20:21
🔗
|
Schbirid |
iirc that is simple and nice :) |
20:21
🔗
|
Aranje |
oh is there one? |
20:21
🔗
|
Aranje |
sweet! |
20:22
🔗
|
Schbirid |
it was his goal to spread it easily |
20:22
🔗
|
Aranje |
awesome :D |
20:22
🔗
|
Aranje |
It's totally something I should have a copy of |
20:22
🔗
|
Schbirid |
you might also be interested in http://gen.lib.rus.ec/ |
20:23
🔗
|
Schbirid |
quite illegal though |
20:25
🔗
|
Aranje |
the lines of legal and illegal blur often for me :) |
20:26
🔗
|
Aranje |
and of course there is a torrent |
20:26
🔗
|
Aranje |
lmao |
20:38
🔗
|
DFJustin |
the dvd image isn't a complete set |
20:38
🔗
|
DFJustin |
but the gutenberg etexts are already mirrored on dozens of mirrors and on archive.org |
20:39
🔗
|
Aranje |
oh, cool |
20:39
🔗
|
DFJustin |
http://www.archive.org/details/gutenberg |
20:40
🔗
|
DFJustin |
http://www.gutenberg.org/catalog/world/mirror-redirect |
20:41
🔗
|
Aranje |
neat :D |