[00:22] what is wget plus warc option? [00:22] http://archiveteam.org/index.php?title=Wget_with_WARC_output [00:23] hopefully it gets accepted into mainline wget with the lastest patch. [00:25] ok is this a good strategy? wget -mcpk woxy.com [00:31] by default, wget -m will respect robots.txt [00:32] which is often bad. (the lachlan cranswick site was actually decent... it only blocked the /reports/ directory, which contains generated website usage reports, which contain tarpit urls) [00:32] Fuuuuuuuuuuuuuuuck robots.txt [00:33] here's the robots.txt for the site: http://woxy.com/robots.txt [00:34] also, k without K? [00:34] and you might want to change the useragent [00:34] i just check robots will not having these pages be an issue? [00:35] so k and K *are* different? [00:36] -k, --convert-links make links in downloaded HTML or CSS point to [00:36] -K, --backup-converted before converting file X, back up as X.orig. [00:36] local files. [00:38] is there any way of knowing how big woxy is so i can allocate some space? [00:39] this will probably be pretty big, with all the MP3s and stuff [00:39] We can handle it. [00:39] Do you have a slot on batcae [00:40] I did. I haven't used it since the most recent friendster block finished [00:41] i wonder how bad my aws bill will be this month [00:41] I'm using a free tier instance, but wound up adding a 100GB ebs volume [00:42] anyway im going to stop the wget now since i probably dont have the space or upload bandwidth to get this anywhere useful, inside of a month, so heres the folder u probably want http://woxy.com/media/audio/ [00:42] i want / [00:50] there is a whole bunch of stuff you would have missed in the blog, such as interviews (with mp3s of the inteviews) [00:54] ugh [00:55] this band seems alright... but in this recording, something sounds wrong with the bass amp, like the surrounds on the speaker are torn or something [00:57] Hate it when that happens [09:05] woxy.com pull complete. total warc size is about 16GB [09:05] ... [09:06] or not [09:07] oh? [09:07] it got oom-killed [09:09] bad sign [09:09] well, it is a ec2 micro instance :( [09:11] ah [09:11] my wget is using 280mb [09:12] of which 104mb is resident [09:12] so I think I'll be ok [09:12] hrm [09:12] I've only managed to download 240 megs [09:12] grr [09:13] my kernel config file says zram was built as a module, but I can't find it in /lib/modules [09:14] how many files do you have in that one directory now? [09:14] in the non-warc directory tree? [09:15] yea [09:15] in your boards directory [09:15] alard: is there a way to write to the warc file without writing a plain output file as well? [09:16] find woxy.com/boards -type f | wc -l [09:16] 35899 [09:18] Coderoe: From a warc perspective, yes, try -O /dev/null. From a wget perspective, no, it seems that -O /dev/null breaks --recursive. [09:18] Coderjoe, that is. [09:19] well that sucks [09:19] In my mobileme script I do a rm -rf afterwards, but it's not ideal, no. [09:21] You might try --delete-after [09:23] (--delete-after doesn't remove the directories, though.) [09:25] the directories are fine [09:26] -O tempfile also works. [09:26] As long as it's downloaded somewhere where the html/css parser can read it, I think. [09:28] too bad any assets loaded by javascript don't get pulled down [09:29] Actually, do NOT use -O tempfile, it messes up the --page-requisites. [09:29] I think with -O it keeps appending [09:29] Why not use Heritrix? [09:32] ugh [09:33] It doesn't have those OOM problems, it can get things loaded by javascript, it takes a lot of xml configuration to run. [09:33] it's java [09:34] which is not terribly friendly for a ec2 micro instance [09:36] nor is your wget usage apparently :) [09:36] yeah... I wonder what is eating all the rams [09:36] I'm not doing -k, so it doesn't need to keep track of urls to rewrite [09:37] is it keeping a list of visited urls or something :-\ [09:38] well, it does have to avoid duplicates [09:38] The --recursive is very memory intensive. [09:39] well, i just threw a 4G swap at it :-\ [09:40] Ah, the wget manual is full of nice surprises: you can combine --delete-after with --no-directories. Then you won't get files *and* you won't get the directories. [09:40] but --delete-after will log that it deleted the file [09:41] which I suppose doesn't matter if it is still in the warc, but someone looking at the log file would wonder what was up [09:41] btw, where are you writing the log file? [09:42] It doesn't log the delete-after with -nv. [09:42] atm, I am not using -nv [09:42] The log file is added to the end of the warc file, if you have a single warc (with warc-max-size=inf, the default). [09:43] I set the size to 1G [09:43] If you have multiple warcs (e.g. warc-max-size=1G), you'll get a meta.warc.gz [09:43] (I was already up to 16G) [09:43] and where does it save it while the downloads are running? [09:43] I don't see a file in any temp directory anywhere [09:44] In a temporary file. The file is created, opened and then immediately unlinked. As far as I understand, this will keep the file for as long as the program needs it. [09:44] it will [09:44] as long as there is at least 1 open fd on it, it will still be around [09:44] There's the temporary log file and a temporary file each time you wget downloads a file. [09:45] Coderjoe: I've had wget eat 12GB RAM [09:45] You can set --warc-tempdir to change the location of the temporary files. [09:56] .... [09:56] http://htop.sourceforge.net/128.png [09:56] someone give me that machine plz [10:31] whoa [16:31] awesome [19:59] Coderjoe: Damn, that's nice [19:59] haha [19:59] that's what i thought [20:00] am i reading it wrong or is that 128 cores with 88gb RAM? [20:10] wholy shiv [20:22] winr4r: 880GB ram [20:23] yeah, that, heh [20:23] :D [20:23] how is everyone tonight? :) [20:27] okay i guess, taking it easy, not looking forward to work tomorrow [20:27] you? [20:27] pretty awesome here! [20:28] nice~ [20:28] i'm off work till next monday [20:28] ah :] [20:30] my boss got back to me after two weeks of ignoring my emails, i think it was signing my last one with "P.S. Answer your fucking emails" that did it [20:30] times is hard! [20:33] lol [20:35] random topic , but archiveing relevant, where is the steve job's life torrent, books audio, docs everything? it used to be when a semi famous person died, there papers and junk, where collected, and usually stored in a library or something. wheres steves stuff? id like to see it and i hate having to hunt all over creation to find it. [20:35] steve is in no danger of disappearing [20:35] chronomex: neither was geocities in 2001 [20:36] then where's his life in data? all in one place would be great [20:36] that's a hell of a comparison [20:36] there needs to be one, as much as i have mixed feelings about steve jobs [20:36] ditto, and me, too, respectively [20:36] I'm going to recuse myself from this before I get angry [20:36] (by which i mean he'll be remembered for doing evil things more than he will [20:36] will be remembered for doing good things* [20:36] chronomex: hmm? [20:37] sorry for annoying you, but what? [20:40] * winr4r hugs chronomex. [21:21] * closure waves to SketchCow @ Facebook from over the way @ Google [21:31] oww [21:31] Mem: 595 589 5 0 0 14 [21:31] Swap: 4095 852 3243 [21:31] free -m [21:32] <3 you wget [22:07] :D [22:08] how goes the woxy project? [22:21] legendary packaging, eh? [22:21] I'm all curious now [22:22] * db48x hmmms [22:22] * winr4r prods dnova and db48x [22:22] careful with that thing [22:23] winr4r: yo [22:53] is there any way around this User-agent: * Disallow: [22:57] in wget? [22:57] sure [22:57] -e robots=off or whatever [23:14] how goes the woxy? wget is currently at 1752M virtual, and still going. (the instance only has 600M or so, which means it is 1320M into swap) [23:24] what's your wget commandline? [23:30] insanity [23:30] wget -nv --delete-after --no-directories -U Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) -m -e robots=no -p --warc-file=woxy.com --warc-max-size=1G --warc-header=operator: Thad Ward for Archive Team --warc-cdx --warc-tempdir=warctmp http://woxy.com/ [23:31] 129670 woxy.com.cdx [23:46] wow