[00:03] Coderjoe: No, it doesn't. It does keep track of when you claimed it, so it's possible to release that later. [00:04] chronomex: You can pull while running. If you stop, you can run dld-single.sh to resume that user. [00:04] aye [00:04] I told it to stop, will pull when it's done [00:07] Stopping can take a while. I've got one user here that I killed yesterday, resumed this morning, and now it's still not finished. [00:07] yeah, I just killed it and resumed [00:07] * chronomex impatient [00:07] * chronomex about to leave [00:11] alard: Is this fire-and-forget now? [00:12] looks like [00:12] underscor: Yes, if there are no more bugs. [00:12] yay [00:14] I've been running it yesterday and today. Got 185 users done. [00:14] (worth 94GB) [00:14] Dang [00:15] How many are there total? [00:15] Also, is it safe to run concurrent clients? [00:15] I've got a list of 300,000, I think. [00:15] jesus shit [00:15] damn [00:16] I'm not sure if they re all this big. The list I've loaded now is from the wayback machine, the other list is from google. [00:16] Concurrent clients is okay. Although it somehow won't get faster than 1MB/s for me, perhaps Akamai is limiting the speed. [00:17] - Running wget --mirror (at least 2174 files)... [00:17] ooh [00:18] Heh. --- But I should really go to sleep now, I'll speak to you later. [00:18] Adios [00:23] I'm getting 20-80Mbps from akamai [00:27] Already at 3G [00:27] dang [00:50] http://thechive.files.wordpress.com/2010/04/photo-subtitle-6.jpg [01:02] I'm appearing to get several seconds of 100-200kbit/s with an occasional burst up to 5-6Mbit/s [02:30] demoscene irl: http://vimeo.com/31158841 [02:43] I'd really like to discuss the resulting files. [03:58] SketchCow: Resulting foles from what? [03:58] files* [03:59] amerrykan: What are they doing? [03:59] Are those birds? [04:03] *tweet* [05:08] giorgio moroder's restoration of fritz lang's metropolis is coming out on dvd and bluray on the 15th [05:09] with or without soundtrack? [05:10] think it would be awesome at double speed and with the TRON soundtrack [05:10] :) [05:11] with moroder's soundtrack [09:15] mm [09:15] no wonder this is going so slow [09:15] get keeps hammering the same url [09:15] https://www.me.com/YouNeedToSpecifyTheAccountName.html: [09:15] 2011-11-03 09:15:30 ERROR 404: Not Found. [09:16] hah, oops :) [09:16] with an occasional good url in there [09:27] Coderjoe: https://www.me.com/YouNeedToSpecifyTheAccountName.html is me.com's way of saying Not Found. [09:30] $ grep 'https://www.me.com/YouNeedToSpecifyTheAccountName.html' wget.log | wc -l [09:30] 57183 [09:30] getting more at a rate of 2 per second [09:31] and getting something else every 15 seconds or so [09:32] Could you run gunzip -c data/h/hu/hub/hubertyamada/web.me.com/web.me.com-hubertyamada.warc.gz | grep "GET /g/" -A 24 | grep HTTP [09:32] but then with the user you're currently getting? [09:33] I expect you'll get a long list with only HTTP 302 responses. [09:33] yser is timmyers [09:34] er, user [09:35] Okay, I'll try it in a moment. [09:36] I think the problem is that some pages on web.me.com have links to images, stylesheets, scripts on web.me.com/g/ . /g/ no longer exists and redirects to www.me.com/YouNeedetc [09:36] 57923 302 [09:36] gunzip -c web.me.com-timmyers.warc.gz | grep "GET /g/" -A 24 | grep "^HTTP" | cut -d " " -f 2 | sort | uniq -c [09:37] Nothing else? Then I think it's safe to start ignoring the /g/ directory. I'm trying that now. [09:37] that is the entire output of that command [09:42] I've pushed an update to github. [09:48] i was using single [09:48] or does single call dld-me-com? [09:48] I forget [09:48] i should be in bed sleeping [09:49] ah [09:49] alright [10:03] Yes, single calls dld-me-com. [10:04] that seems to be doing a much better job of downloading stuff [10:04] averaging 4Mbps [10:07] well, i'll leave this be while i sleep [10:07] alard: btw, SketchCow wanted to talk about output files [10:08] Ah, okay, I saw something like that. Let's do that when he's back. [10:09] (Although we've covered most of that before, I think.) [10:29] looks like i need a new hd [10:30] its just clicking [10:30] bad time to need a new hard drive [10:31] i really fortunately have a spare [10:34] like i never have extra computer crap but i have one that's the exact same size [10:34] ffff [10:44] haha [14:17] Back [14:17] More than the format, which we decided to be WARC, I mean the arrangement of people. [14:17] Because we need to be able to pull in by-site stuff, and arrange. [15:00] SketchCow: What do you suggest? [15:02] Right now, the structure is data/u/us/use/username/domain (where domain is public.me.com, web.me.com, homepage.mac.com and gallery.me.com) [15:02] Approved. [15:02] That's all. [15:03] And then we'll need to write something that ties these into a pretty bow [15:14] Dilemma: should the script stop if wget returns a network error? [15:14] Yes, of course, since that is not a complete download. [15:15] But also no: it also returns a network error if it can't resolve a hostname. (A few users have links to domains that no longer exist.) [15:24] Solution: remove the urls from user-owned domains (the content is always the same as the content on web.me.com anyway). [17:38] huh, wow -- public.me.com actually uses HTTP 207 [17:38] I haven't seen that response in quite a while [18:03] alard: in the future, perhaps the script(s) should report to the tracker what version was used to fetch an item, for record-keeping, in the case of some major bugfix or something. [18:17] https://www.google.com/search?q=do+a+barrel+roll [19:02] Coderjoe: good idea, the scripts now include a version number. [19:20] 31 users from me.com [19:20] 64GB [19:20] Dang! [19:25] http://www.google.com/search?hl=en&q=askew [20:03] underscor (and chronomex, Coderjoe): If you haven't yet done so today, please do a git pull and run ./fix-dld.sh to get the latest web.me.com data. [20:07] alard: If I git pulled and started some new ones, is fix-dld smart enough to not redo those? [20:07] Yes. It looks for the wget-discover.log, which only the good ones have. [20:08] aha [20:08] It doesn't do anything, it seems [20:09] 0 8:09PM:abuie@abuie-dev:/2/mobileme-grab 9660 π ./fix-dld.sh underscor [20:09] 0 8:09PM:abuie@abuie-dev:/2/mobileme-grab 9661 π [20:12] Then you probably started with a good version. [20:13] Oh okay, excellent [20:13] Yeah, I started late last night [20:13] The new version of the script may be somewhat faster than the version you started with. [20:13] Yeah, I saw it's filtering out a bunch of 404s now [20:14] Yes, and wget now won't download urls more than once. [20:14] Ooh, nice [20:15] Whee, 79GB now [20:15] Getting over 100mbps from akamai [20:27] That's nice. For some reason my downloads stick at 1-2 MB/s. [20:27] Too far away from the source, perhaps. [20:36] Possibly [20:37] Although you'd think they have edge nodes near you [20:37] 2mb/s is SLOW!? [20:39] Relatively speaking, yes. My connection can do 4MB/s. [20:39] bsmith094: fuck yes it is [20:40] WTH are you running on?! [20:41] i have cable broadband ive SEEN 4mb/s for about 500ms but i usually top out at 1.8 [20:46] I'm on FiOS for my house [20:47] so I top out at 30mb/s [20:47] But this is running on 10gigE [21:03] Is the owner of 38.104.224.202 here? [21:04] alard: http://pastebin.com/6jABAcxV [21:05] my ec2 node is only doing 4Mbps with two downloaders running [21:05] Coderjoe: Can you search data/h/hy/hyp/hyphotoad1024/web.me.com/wget.log for the error? Probably something with Cannot. [21:06] several wget-warc: unable to resolve host address `www.mslphotography.biz' [21:07] Okay, then you're probably running an old version of the script. [21:08] From one of the feeds on the site it gets urls from www.mslphotography.biz, probably the domain owned by that user. The domain no longer exists, wget returns a network error which is picked up by the script. [21:08] currently running commit b5584db0f [21:08] This happens with a few accounts I've seen so far. The updated version of dld-me-com.sh only downloads from urls on web.me.com from the feeds. [21:09] b5584db0f is a bit old. [21:09] it killed the dld-client loop when it hit that error, too [21:09] The latest version doesn't have this error and is also faster. [21:10] i'm waiting for the other client to finish [21:10] The dld-client looks for the wget exit code. If there's anything wrong (not being a 404 or not authenticated error) it stops. [21:11] You could try updating while the old one is running. Download the new version, touch STOP, run a new client. The old client will stop, the new client will keep running. [21:12] (The new version looks at the modification time of the STOP file.) [21:12] Meanwhile, someone is sending silly data to the tracker. [21:16] so i dont think it's my hd that died so hopefully i'll be back online downloading [21:16] when the new motherboard comes :/ [21:17] i was finally able to put some bandwidth and disk in and it went to shit :p [22:13] Well, OK. [22:14] ---------------------------------------------------------------- [22:14] How about THIS shizzle - AOL is going to turn off 650 Mailing Lists (LISTSERVS) in 30 days [22:14] Let's track those things down and squirrel away those 200mb or whatever [22:14] ---------------------------------------------------------------- [22:17] http://www.infoworld.com/d/cloud-computing/aol-discontinues-listserv-mailing-list-service-177939 [22:17] http://listserv.aol.com/ [22:17] Someone kindly WARC that bitch [22:53] Hi, I'm a writer for the MIT Technology Review doing an article on Archive Team. Today I'm looking for Qwerty0, or anyone who has his email address, or anyone who wants to talk about Poetry.com. PM me if any of this applies. Thank you and /spam. [22:57] Are you the guy's been calling me? [22:59] Nope. Also, not a guy. [23:06] Oh, you're Jason. Hello! I was hoping to talk to one more person from Archive Team, and my editor thinks the Poetry.com story is nicely "colorful," so here I am, lurking :) [23:23] OK. [23:23] There was another person on this tear to interview me and I heard no good words. [23:24] Of the people here, if it matters what my thoughts are, chronomex, underscor and a few others were involved in poetry.com [23:24] He's not online right now, but BlueMax was the kid who got threatened. [23:25] Thanks! [23:25] all good to know [23:27] hi [23:27] * chronomex peeks in [23:28] alard: okay. ill kick that shit off when i get home, maybe 5 hours from now [23:37] SketchCow: I have a wget-warc on it. they don't have a robots.txt however. [23:38] Yeah [23:38] Issue is we need to get past the passwording thing. [23:38] And some lists aren't listed. [23:38] oh. [23:39] well that is ass [23:43] i have a password. now I just need to feed a cookie to wget [23:45] HELLO [23:45] --------------------------------------------------------- [23:45] http://www.nature.com/scientificamerican/archive/index.html [23:45] Just. Do. It. [23:45] --------------------------------------------------------- [23:47] SketchCow/Coderjoe: Email may also be a way in. You can subscribe and ask for email archives.