[00:17] wepp@Hildr ~/Archiveteam/sdb $ du -sh mobileme-grab04/data/t/ta/tak/take_junichiro/ [00:17] 32G mobileme-grab04/data/t/ta/tak/take_junichiro/ [00:17] Hmmmmm.... [00:18] Well, that explains why it's been going for two and a half days now. [00:23] lol [00:54] Wyatt|Wor: When that happens, do you check the wget log for problems? [00:56] Pronoiac: I haven't yet. I only just noticed that it was happening an hour ago [00:58] Hm, time to figure out why wget refused to follow links within these pages [00:59] Despite all being under the same structure [01:01] Pronoiac: Good call. This is in loop-hell [01:03] Looking for even this one file, http://dpaste.com/739562/ [01:04] I cut it off at 50, but it shows in the log 486 times... [01:05] Should I just remove that whole directory? [01:08] Uh, I'm not an authority on this, but I just ctrl-C'ed it & restarted seesaw. [01:09] I'd leave the directory in place, unless you need the space. [01:09] It might be useful for diagnosis later. [01:09] I figure another pass will be useful for weird items like this. [01:15] I have some technical thoughts on the recursive loops. Should I share? [01:22] Pronoiac: I've got no objections. I'll note that alard is AFAIK the maintainer for universal-tracker and the downloader clients. [01:23] Should I braindump here, or contact alard? [01:24] I think he'll see it when ever he sees it if you put it here. Also opens it up for commentary from others. [01:37] Okay, so here's one problem: Some items form a recursive loop happening with multiple slashes - / -> // -> /// etc. [01:37] I *think* the norm to avoid recursive loops in wget uses timestamping (-N) to avoid re-fetching items. [01:37] This option's disabled by default, to avoid warc searching or something, even when we're saving files into the usual tree. [01:37] I edited a local wget-warc to enable that, but it didn't work - [01:37] after fetching a // file, it would parse it before checking for an already-existing file. (This might have been due to my faulty coding.) [01:37] So, I see a couple of ways forward: [01:37] Option 1. Fix timestamping in wget-warc, & check for an existent file before trying to fetch or parse. [01:38] Option 2. Instead of doing an automatic recursive fetch, build a list of files somehow, and pass that list to wget. [01:38] As a bonus, option 2 would avoid a problem I've seen elsewhere, where hundreds of references to 404'ed files result in hundreds of requests to those files, with wget using gigabytes of memory and getting oom-killed. [01:39] Cool? [01:39] So, I got my drivers license yesterday [01:39] FIRST TIME ALONE? [01:39] Going to Burger King [01:39] hell yes [01:39] hahahahah [02:12] See the annoying thing here is wget can't suppress multiple consecutive slashes or it leads to unexpected behaviour. [02:13] According to this thread, it used to but no longer does: http://marc.info/?l=wget&m=108972466200930&w=2 [02:13] ... that's a problem :[ [02:17] Also relevant http://osdir.com/ml/bug-wget-gnu/2010-06/msg00012.html [02:20] Seems the proposal for a URI slash-normalising option was never acked. [02:27] Now that I look...this looks like it was just spidering over a bunch of other users' homepage.mac.com [02:29] Yeah, totally. [02:29] I just downloaded about 340 people's homepage sites. :/ [02:30] is that bad? [02:30] or have those already been downloaded? [02:32] balrog_: I have no way of knowing that, but it's probably somewhat suboptimal that attemptin to mirror homepage.mac.com/take_junichiro/ started me off on spidering what could have been all of homepage.mac.com [02:32] this is true. [02:35] That sounds familiar - let me see if I can help narrow down the problem. [02:37] I had swampyhatto spread out over around 150 others. [02:51] Okay, I've done a bit of grepping - I think the problem came from weirdly formed links - like [02:52] That's what I thought. [02:53] If you want to do similar diagnosis, go into the data directory that it's fetching into, "ls -alt | tail", and recursively grep for some of the oldest stuff in the right folder. [02:54] I think it's meant for stylesheets & whatnot, so wget thinks they're necessary for proper display. [05:20] Ops, please. [05:58] cannot give ops, sry [05:59] someone op SketchCo1 plz [07:52] greetings dan [07:52] how's Michigan tonight? [07:53] hello [07:53] cold. [07:53] almost may, should be warming up! [07:53] would be nice. [15:16] OK. [15:17] 1. Batcave had the vast majority of material fly over. Almost done with it. So, alard - please start moving people away from batcave. [15:18] 2. A new batcave will replace it in the future, but I didn't think it fair to be chop-chop when I was 2 months to get batcave clean [15:19] 3. The temporary holdoff worked - I've gotten us caught up on the mobileme uploads. 40% disk utilitzation and dropping on that disk. [15:19] oli: is this you? http://flask.pocoo.org/snippets/57/ [15:48] kennethre: hm? microtransactions? ;o [15:58] 01,00/whois archinet [16:45] SketchCo1: I can't fix this error: http://archive.org/details/this_week_in_fun [16:45] SketchCo1: this is also have the same problem here: http://archive.org/details/abbys_road [16:46] YEah [16:46] Yes. [16:46] Yes, your XML makes the system crash. [16:46] its cause some meta tags in the mp3s are weird [16:46] Yes. [16:47] I'll mention it to the devs. [16:47] thanks [16:48] i emailed in there error report on friday and no reply yet [16:49] i think i stop this from happen with roz_rows_the_pacific since had to tell archive to look at .nfo as text files [16:50] i just noticed the titles of the mp3s was off [16:51] SketchCo1: this new podcast can be added to twit-podcasts: http://archive.org/details/the_laporte_report [17:00] i'm now uploading jumping monkeys [17:17] oh jamendo, http://blog.jamendo.com/2012/04/26/jamendo-has-a-new-look/ [19:46] SketchCo1: Thanks. I have switched the memac upload target. The current uploads will finish via batcave, new ones will go directly to s3. [20:00] Jumping Monkeys is uploaded finally: http://archive.org/details/jumping_monkeys [21:12] i found game shark magazine [22:49] http://twitter.com/#!/h0mfr0g/status/28509114379 [22:50] OH SNAP