Time |
Nickname |
Message |
00:22
🔗
|
bsmith093 |
what is wget plus warc option? |
00:22
🔗
|
Coderjoe |
http://archiveteam.org/index.php?title=Wget_with_WARC_output |
00:23
🔗
|
Coderjoe |
hopefully it gets accepted into mainline wget with the lastest patch. |
00:25
🔗
|
bsmith093 |
ok is this a good strategy? wget -mcpk woxy.com |
00:31
🔗
|
Coderjoe |
by default, wget -m will respect robots.txt |
00:32
🔗
|
Coderjoe |
which is often bad. (the lachlan cranswick site was actually decent... it only blocked the /reports/ directory, which contains generated website usage reports, which contain tarpit urls) |
00:32
🔗
|
SketchCow |
Fuuuuuuuuuuuuuuuck robots.txt |
00:33
🔗
|
dashcloud |
here's the robots.txt for the site: http://woxy.com/robots.txt |
00:34
🔗
|
Coderjoe |
also, k without K? |
00:34
🔗
|
Coderjoe |
and you might want to change the useragent |
00:34
🔗
|
bsmith093 |
i just check robots will not having these pages be an issue? |
00:35
🔗
|
bsmith093 |
so k and K *are* different? |
00:36
🔗
|
Coderjoe |
-k, --convert-links make links in downloaded HTML or CSS point to |
00:36
🔗
|
Coderjoe |
-K, --backup-converted before converting file X, back up as X.orig. |
00:36
🔗
|
Coderjoe |
local files. |
00:38
🔗
|
bsmith093 |
is there any way of knowing how big woxy is so i can allocate some space? |
00:39
🔗
|
Coderjoe |
this will probably be pretty big, with all the MP3s and stuff |
00:39
🔗
|
SketchCow |
We can handle it. |
00:39
🔗
|
SketchCow |
Do you have a slot on batcae |
00:40
🔗
|
Coderjoe |
I did. I haven't used it since the most recent friendster block finished |
00:41
🔗
|
Coderjoe |
i wonder how bad my aws bill will be this month |
00:41
🔗
|
Coderjoe |
I'm using a free tier instance, but wound up adding a 100GB ebs volume |
00:42
🔗
|
bsmith093 |
anyway im going to stop the wget now since i probably dont have the space or upload bandwidth to get this anywhere useful, inside of a month, so heres the folder u probably want http://woxy.com/media/audio/ |
00:42
🔗
|
Coderjoe |
i want / |
00:50
🔗
|
Coderjoe |
there is a whole bunch of stuff you would have missed in the blog, such as interviews (with mp3s of the inteviews) |
00:54
🔗
|
Coderjoe |
ugh |
00:55
🔗
|
Coderjoe |
this band seems alright... but in this recording, something sounds wrong with the bass amp, like the surrounds on the speaker are torn or something |
00:57
🔗
|
underscor |
Hate it when that happens |
09:05
🔗
|
Coderjoe |
woxy.com pull complete. total warc size is about 16GB |
09:05
🔗
|
Coderjoe |
... |
09:06
🔗
|
Coderjoe |
or not |
09:07
🔗
|
db48x2 |
oh? |
09:07
🔗
|
Coderjoe |
it got oom-killed |
09:09
🔗
|
db48x2 |
bad sign |
09:09
🔗
|
Coderjoe |
well, it is a ec2 micro instance :( |
09:11
🔗
|
db48x2 |
ah |
09:11
🔗
|
db48x2 |
my wget is using 280mb |
09:12
🔗
|
db48x2 |
of which 104mb is resident |
09:12
🔗
|
db48x2 |
so I think I'll be ok |
09:12
🔗
|
db48x2 |
hrm |
09:12
🔗
|
db48x2 |
I've only managed to download 240 megs |
09:12
🔗
|
Coderjoe |
grr |
09:13
🔗
|
Coderjoe |
my kernel config file says zram was built as a module, but I can't find it in /lib/modules |
09:14
🔗
|
db48x2 |
how many files do you have in that one directory now? |
09:14
🔗
|
Coderjoe |
in the non-warc directory tree? |
09:15
🔗
|
db48x2 |
yea |
09:15
🔗
|
db48x2 |
in your boards directory |
09:15
🔗
|
Coderjoe |
alard: is there a way to write to the warc file without writing a plain output file as well? |
09:16
🔗
|
Coderjoe |
find woxy.com/boards -type f | wc -l |
09:16
🔗
|
Coderjoe |
35899 |
09:18
🔗
|
alard |
Coderoe: From a warc perspective, yes, try -O /dev/null. From a wget perspective, no, it seems that -O /dev/null breaks --recursive. |
09:18
🔗
|
alard |
Coderjoe, that is. |
09:19
🔗
|
Coderjoe |
well that sucks |
09:19
🔗
|
alard |
In my mobileme script I do a rm -rf afterwards, but it's not ideal, no. |
09:21
🔗
|
alard |
You might try --delete-after |
09:23
🔗
|
alard |
(--delete-after doesn't remove the directories, though.) |
09:25
🔗
|
Coderjoe |
the directories are fine |
09:26
🔗
|
alard |
-O tempfile also works. |
09:26
🔗
|
alard |
As long as it's downloaded somewhere where the html/css parser can read it, I think. |
09:28
🔗
|
Coderjoe |
too bad any assets loaded by javascript don't get pulled down |
09:29
🔗
|
alard |
Actually, do NOT use -O tempfile, it messes up the --page-requisites. |
09:29
🔗
|
Coderjoe |
I think with -O it keeps appending |
09:29
🔗
|
alard |
Why not use Heritrix? |
09:32
🔗
|
Coderjoe |
ugh |
09:33
🔗
|
alard |
It doesn't have those OOM problems, it can get things loaded by javascript, it takes a lot of xml configuration to run. |
09:33
🔗
|
Coderjoe |
it's java |
09:34
🔗
|
Coderjoe |
which is not terribly friendly for a ec2 micro instance |
09:36
🔗
|
ersi |
nor is your wget usage apparently :) |
09:36
🔗
|
Coderjoe |
yeah... I wonder what is eating all the rams |
09:36
🔗
|
Coderjoe |
I'm not doing -k, so it doesn't need to keep track of urls to rewrite |
09:37
🔗
|
Coderjoe |
is it keeping a list of visited urls or something :-\ |
09:38
🔗
|
db48x2 |
well, it does have to avoid duplicates |
09:38
🔗
|
alard |
The --recursive is very memory intensive. |
09:39
🔗
|
Coderjoe |
well, i just threw a 4G swap at it :-\ |
09:40
🔗
|
alard |
Ah, the wget manual is full of nice surprises: you can combine --delete-after with --no-directories. Then you won't get files *and* you won't get the directories. |
09:40
🔗
|
Coderjoe |
but --delete-after will log that it deleted the file |
09:41
🔗
|
Coderjoe |
which I suppose doesn't matter if it is still in the warc, but someone looking at the log file would wonder what was up |
09:41
🔗
|
Coderjoe |
btw, where are you writing the log file? |
09:42
🔗
|
alard |
It doesn't log the delete-after with -nv. |
09:42
🔗
|
Coderjoe |
atm, I am not using -nv |
09:42
🔗
|
alard |
The log file is added to the end of the warc file, if you have a single warc (with warc-max-size=inf, the default). |
09:43
🔗
|
Coderjoe |
I set the size to 1G |
09:43
🔗
|
alard |
If you have multiple warcs (e.g. warc-max-size=1G), you'll get a meta.warc.gz |
09:43
🔗
|
Coderjoe |
(I was already up to 16G) |
09:43
🔗
|
Coderjoe |
and where does it save it while the downloads are running? |
09:43
🔗
|
Coderjoe |
I don't see a file in any temp directory anywhere |
09:44
🔗
|
alard |
In a temporary file. The file is created, opened and then immediately unlinked. As far as I understand, this will keep the file for as long as the program needs it. |
09:44
🔗
|
Coderjoe |
it will |
09:44
🔗
|
Coderjoe |
as long as there is at least 1 open fd on it, it will still be around |
09:44
🔗
|
alard |
There's the temporary log file and a temporary file each time you wget downloads a file. |
09:45
🔗
|
ersi |
Coderjoe: I've had wget eat 12GB RAM |
09:45
🔗
|
alard |
You can set --warc-tempdir to change the location of the temporary files. |
09:56
🔗
|
Coderjoe |
.... |
09:56
🔗
|
Coderjoe |
http://htop.sourceforge.net/128.png |
09:56
🔗
|
Coderjoe |
someone give me that machine plz |
10:31
🔗
|
db48x2 |
whoa |
16:31
🔗
|
sp0rus |
awesome |
19:59
🔗
|
underscor |
Coderjoe: Damn, that's nice |
19:59
🔗
|
underscor |
haha |
19:59
🔗
|
winr4r |
that's what i thought |
20:00
🔗
|
winr4r |
am i reading it wrong or is that 128 cores with 88gb RAM? |
20:10
🔗
|
chronomex |
wholy shiv |
20:22
🔗
|
underscor |
winr4r: 880GB ram |
20:23
🔗
|
winr4r |
yeah, that, heh |
20:23
🔗
|
underscor |
:D |
20:23
🔗
|
winr4r |
how is everyone tonight? :) |
20:27
🔗
|
Frigolit |
okay i guess, taking it easy, not looking forward to work tomorrow |
20:27
🔗
|
Frigolit |
you? |
20:27
🔗
|
winr4r |
pretty awesome here! |
20:28
🔗
|
Frigolit |
nice~ |
20:28
🔗
|
winr4r |
i'm off work till next monday |
20:28
🔗
|
Frigolit |
ah :] |
20:30
🔗
|
winr4r |
my boss got back to me after two weeks of ignoring my emails, i think it was signing my last one with "P.S. Answer your fucking emails" that did it |
20:30
🔗
|
winr4r |
times is hard! |
20:33
🔗
|
underscor |
lol |
20:35
🔗
|
bsmith093 |
random topic , but archiveing relevant, where is the steve job's life torrent, books audio, docs everything? it used to be when a semi famous person died, there papers and junk, where collected, and usually stored in a library or something. wheres steves stuff? id like to see it and i hate having to hunt all over creation to find it. |
20:35
🔗
|
chronomex |
steve is in no danger of disappearing |
20:35
🔗
|
winr4r |
chronomex: neither was geocities in 2001 |
20:36
🔗
|
bsmith093 |
then where's his life in data? all in one place would be great |
20:36
🔗
|
chronomex |
that's a hell of a comparison |
20:36
🔗
|
winr4r |
there needs to be one, as much as i have mixed feelings about steve jobs |
20:36
🔗
|
bsmith093 |
ditto, and me, too, respectively |
20:36
🔗
|
chronomex |
I'm going to recuse myself from this before I get angry |
20:36
🔗
|
winr4r |
(by which i mean he'll be remembered for doing evil things more than he will |
20:36
🔗
|
winr4r |
will be remembered for doing good things* |
20:36
🔗
|
winr4r |
chronomex: hmm? |
20:37
🔗
|
winr4r |
sorry for annoying you, but what? |
20:40
🔗
|
* |
winr4r hugs chronomex. |
21:21
🔗
|
* |
closure waves to SketchCow @ Facebook from over the way @ Google |
21:31
🔗
|
Coderjoe |
oww |
21:31
🔗
|
Coderjoe |
Mem: 595 589 5 0 0 14 |
21:31
🔗
|
Coderjoe |
Swap: 4095 852 3243 |
21:31
🔗
|
Coderjoe |
free -m |
21:32
🔗
|
Coderjoe |
<3 you wget |
22:07
🔗
|
winr4r |
:D |
22:08
🔗
|
dashcloud |
how goes the woxy project? |
22:21
🔗
|
dnova |
legendary packaging, eh? |
22:21
🔗
|
dnova |
I'm all curious now |
22:22
🔗
|
* |
db48x hmmms |
22:22
🔗
|
* |
winr4r prods dnova and db48x |
22:22
🔗
|
dnova |
careful with that thing |
22:23
🔗
|
db48x |
winr4r: yo |
22:53
🔗
|
bsmith093 |
is there any way around this User-agent: * Disallow: |
22:57
🔗
|
db48x |
in wget? |
22:57
🔗
|
db48x |
sure |
22:57
🔗
|
db48x |
-e robots=off or whatever |
23:14
🔗
|
Coderjoe |
how goes the woxy? wget is currently at 1752M virtual, and still going. (the instance only has 600M or so, which means it is 1320M into swap) |
23:24
🔗
|
dashcloud |
what's your wget commandline? |
23:30
🔗
|
Coderjoe |
insanity |
23:30
🔗
|
Coderjoe |
wget -nv --delete-after --no-directories -U Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) -m -e robots=no -p --warc-file=woxy.com --warc-max-size=1G --warc-header=operator: Thad Ward for Archive Team --warc-cdx --warc-tempdir=warctmp http://woxy.com/ |
23:31
🔗
|
Coderjoe |
129670 woxy.com.cdx |
23:46
🔗
|
dashcloud |
wow |