#archiveteam 2011-10-25,Tue

↑back Search

Time Nickname Message
00:22 🔗 bsmith093 what is wget plus warc option?
00:22 🔗 Coderjoe http://archiveteam.org/index.php?title=Wget_with_WARC_output
00:23 🔗 Coderjoe hopefully it gets accepted into mainline wget with the lastest patch.
00:25 🔗 bsmith093 ok is this a good strategy? wget -mcpk woxy.com
00:31 🔗 Coderjoe by default, wget -m will respect robots.txt
00:32 🔗 Coderjoe which is often bad. (the lachlan cranswick site was actually decent... it only blocked the /reports/ directory, which contains generated website usage reports, which contain tarpit urls)
00:32 🔗 SketchCow Fuuuuuuuuuuuuuuuck robots.txt
00:33 🔗 dashcloud here's the robots.txt for the site: http://woxy.com/robots.txt
00:34 🔗 Coderjoe also, k without K?
00:34 🔗 Coderjoe and you might want to change the useragent
00:34 🔗 bsmith093 i just check robots will not having these pages be an issue?
00:35 🔗 bsmith093 so k and K *are* different?
00:36 🔗 Coderjoe -k, --convert-links make links in downloaded HTML or CSS point to
00:36 🔗 Coderjoe -K, --backup-converted before converting file X, back up as X.orig.
00:36 🔗 Coderjoe local files.
00:38 🔗 bsmith093 is there any way of knowing how big woxy is so i can allocate some space?
00:39 🔗 Coderjoe this will probably be pretty big, with all the MP3s and stuff
00:39 🔗 SketchCow We can handle it.
00:39 🔗 SketchCow Do you have a slot on batcae
00:40 🔗 Coderjoe I did. I haven't used it since the most recent friendster block finished
00:41 🔗 Coderjoe i wonder how bad my aws bill will be this month
00:41 🔗 Coderjoe I'm using a free tier instance, but wound up adding a 100GB ebs volume
00:42 🔗 bsmith093 anyway im going to stop the wget now since i probably dont have the space or upload bandwidth to get this anywhere useful, inside of a month, so heres the folder u probably want http://woxy.com/media/audio/
00:42 🔗 Coderjoe i want /
00:50 🔗 Coderjoe there is a whole bunch of stuff you would have missed in the blog, such as interviews (with mp3s of the inteviews)
00:54 🔗 Coderjoe ugh
00:55 🔗 Coderjoe this band seems alright... but in this recording, something sounds wrong with the bass amp, like the surrounds on the speaker are torn or something
00:57 🔗 underscor Hate it when that happens
09:05 🔗 Coderjoe woxy.com pull complete. total warc size is about 16GB
09:05 🔗 Coderjoe ...
09:06 🔗 Coderjoe or not
09:07 🔗 db48x2 oh?
09:07 🔗 Coderjoe it got oom-killed
09:09 🔗 db48x2 bad sign
09:09 🔗 Coderjoe well, it is a ec2 micro instance :(
09:11 🔗 db48x2 ah
09:11 🔗 db48x2 my wget is using 280mb
09:12 🔗 db48x2 of which 104mb is resident
09:12 🔗 db48x2 so I think I'll be ok
09:12 🔗 db48x2 hrm
09:12 🔗 db48x2 I've only managed to download 240 megs
09:12 🔗 Coderjoe grr
09:13 🔗 Coderjoe my kernel config file says zram was built as a module, but I can't find it in /lib/modules
09:14 🔗 db48x2 how many files do you have in that one directory now?
09:14 🔗 Coderjoe in the non-warc directory tree?
09:15 🔗 db48x2 yea
09:15 🔗 db48x2 in your boards directory
09:15 🔗 Coderjoe alard: is there a way to write to the warc file without writing a plain output file as well?
09:16 🔗 Coderjoe find woxy.com/boards -type f | wc -l
09:16 🔗 Coderjoe 35899
09:18 🔗 alard Coderoe: From a warc perspective, yes, try -O /dev/null. From a wget perspective, no, it seems that -O /dev/null breaks --recursive.
09:18 🔗 alard Coderjoe, that is.
09:19 🔗 Coderjoe well that sucks
09:19 🔗 alard In my mobileme script I do a rm -rf afterwards, but it's not ideal, no.
09:21 🔗 alard You might try --delete-after
09:23 🔗 alard (--delete-after doesn't remove the directories, though.)
09:25 🔗 Coderjoe the directories are fine
09:26 🔗 alard -O tempfile also works.
09:26 🔗 alard As long as it's downloaded somewhere where the html/css parser can read it, I think.
09:28 🔗 Coderjoe too bad any assets loaded by javascript don't get pulled down
09:29 🔗 alard Actually, do NOT use -O tempfile, it messes up the --page-requisites.
09:29 🔗 Coderjoe I think with -O it keeps appending
09:29 🔗 alard Why not use Heritrix?
09:32 🔗 Coderjoe ugh
09:33 🔗 alard It doesn't have those OOM problems, it can get things loaded by javascript, it takes a lot of xml configuration to run.
09:33 🔗 Coderjoe it's java
09:34 🔗 Coderjoe which is not terribly friendly for a ec2 micro instance
09:36 🔗 ersi nor is your wget usage apparently :)
09:36 🔗 Coderjoe yeah... I wonder what is eating all the rams
09:36 🔗 Coderjoe I'm not doing -k, so it doesn't need to keep track of urls to rewrite
09:37 🔗 Coderjoe is it keeping a list of visited urls or something :-\
09:38 🔗 db48x2 well, it does have to avoid duplicates
09:38 🔗 alard The --recursive is very memory intensive.
09:39 🔗 Coderjoe well, i just threw a 4G swap at it :-\
09:40 🔗 alard Ah, the wget manual is full of nice surprises: you can combine --delete-after with --no-directories. Then you won't get files *and* you won't get the directories.
09:40 🔗 Coderjoe but --delete-after will log that it deleted the file
09:41 🔗 Coderjoe which I suppose doesn't matter if it is still in the warc, but someone looking at the log file would wonder what was up
09:41 🔗 Coderjoe btw, where are you writing the log file?
09:42 🔗 alard It doesn't log the delete-after with -nv.
09:42 🔗 Coderjoe atm, I am not using -nv
09:42 🔗 alard The log file is added to the end of the warc file, if you have a single warc (with warc-max-size=inf, the default).
09:43 🔗 Coderjoe I set the size to 1G
09:43 🔗 alard If you have multiple warcs (e.g. warc-max-size=1G), you'll get a meta.warc.gz
09:43 🔗 Coderjoe (I was already up to 16G)
09:43 🔗 Coderjoe and where does it save it while the downloads are running?
09:43 🔗 Coderjoe I don't see a file in any temp directory anywhere
09:44 🔗 alard In a temporary file. The file is created, opened and then immediately unlinked. As far as I understand, this will keep the file for as long as the program needs it.
09:44 🔗 Coderjoe it will
09:44 🔗 Coderjoe as long as there is at least 1 open fd on it, it will still be around
09:44 🔗 alard There's the temporary log file and a temporary file each time you wget downloads a file.
09:45 🔗 ersi Coderjoe: I've had wget eat 12GB RAM
09:45 🔗 alard You can set --warc-tempdir to change the location of the temporary files.
09:56 🔗 Coderjoe ....
09:56 🔗 Coderjoe http://htop.sourceforge.net/128.png
09:56 🔗 Coderjoe someone give me that machine plz
10:31 🔗 db48x2 whoa
16:31 🔗 sp0rus awesome
19:59 🔗 underscor Coderjoe: Damn, that's nice
19:59 🔗 underscor haha
19:59 🔗 winr4r that's what i thought
20:00 🔗 winr4r am i reading it wrong or is that 128 cores with 88gb RAM?
20:10 🔗 chronomex wholy shiv
20:22 🔗 underscor winr4r: 880GB ram
20:23 🔗 winr4r yeah, that, heh
20:23 🔗 underscor :D
20:23 🔗 winr4r how is everyone tonight? :)
20:27 🔗 Frigolit okay i guess, taking it easy, not looking forward to work tomorrow
20:27 🔗 Frigolit you?
20:27 🔗 winr4r pretty awesome here!
20:28 🔗 Frigolit nice~
20:28 🔗 winr4r i'm off work till next monday
20:28 🔗 Frigolit ah :]
20:30 🔗 winr4r my boss got back to me after two weeks of ignoring my emails, i think it was signing my last one with "P.S. Answer your fucking emails" that did it
20:30 🔗 winr4r times is hard!
20:33 🔗 underscor lol
20:35 🔗 bsmith093 random topic , but archiveing relevant, where is the steve job's life torrent, books audio, docs everything? it used to be when a semi famous person died, there papers and junk, where collected, and usually stored in a library or something. wheres steves stuff? id like to see it and i hate having to hunt all over creation to find it.
20:35 🔗 chronomex steve is in no danger of disappearing
20:35 🔗 winr4r chronomex: neither was geocities in 2001
20:36 🔗 bsmith093 then where's his life in data? all in one place would be great
20:36 🔗 chronomex that's a hell of a comparison
20:36 🔗 winr4r there needs to be one, as much as i have mixed feelings about steve jobs
20:36 🔗 bsmith093 ditto, and me, too, respectively
20:36 🔗 chronomex I'm going to recuse myself from this before I get angry
20:36 🔗 winr4r (by which i mean he'll be remembered for doing evil things more than he will
20:36 🔗 winr4r will be remembered for doing good things*
20:36 🔗 winr4r chronomex: hmm?
20:37 🔗 winr4r sorry for annoying you, but what?
20:40 🔗 * winr4r hugs chronomex.
21:21 🔗 * closure waves to SketchCow @ Facebook from over the way @ Google
21:31 🔗 Coderjoe oww
21:31 🔗 Coderjoe Mem: 595 589 5 0 0 14
21:31 🔗 Coderjoe Swap: 4095 852 3243
21:31 🔗 Coderjoe free -m
21:32 🔗 Coderjoe <3 you wget
22:07 🔗 winr4r :D
22:08 🔗 dashcloud how goes the woxy project?
22:21 🔗 dnova legendary packaging, eh?
22:21 🔗 dnova I'm all curious now
22:22 🔗 * db48x hmmms
22:22 🔗 * winr4r prods dnova and db48x
22:22 🔗 dnova careful with that thing
22:23 🔗 db48x winr4r: yo
22:53 🔗 bsmith093 is there any way around this User-agent: * Disallow:
22:57 🔗 db48x in wget?
22:57 🔗 db48x sure
22:57 🔗 db48x -e robots=off or whatever
23:14 🔗 Coderjoe how goes the woxy? wget is currently at 1752M virtual, and still going. (the instance only has 600M or so, which means it is 1320M into swap)
23:24 🔗 dashcloud what's your wget commandline?
23:30 🔗 Coderjoe insanity
23:30 🔗 Coderjoe wget -nv --delete-after --no-directories -U Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) -m -e robots=no -p --warc-file=woxy.com --warc-max-size=1G --warc-header=operator: Thad Ward for Archive Team --warc-cdx --warc-tempdir=warctmp http://woxy.com/
23:31 🔗 Coderjoe 129670 woxy.com.cdx
23:46 🔗 dashcloud wow

irclogger-viewer