#archiveteam 2011-11-03,Thu

↑back Search

Time Nickname Message
00:03 🔗 alard Coderjoe: No, it doesn't. It does keep track of when you claimed it, so it's possible to release that later.
00:04 🔗 alard chronomex: You can pull while running. If you stop, you can run dld-single.sh to resume that user.
00:04 🔗 chronomex aye
00:04 🔗 chronomex I told it to stop, will pull when it's done
00:07 🔗 alard Stopping can take a while. I've got one user here that I killed yesterday, resumed this morning, and now it's still not finished.
00:07 🔗 chronomex yeah, I just killed it and resumed
00:07 🔗 * chronomex impatient
00:07 🔗 * chronomex about to leave
00:11 🔗 underscor alard: Is this fire-and-forget now?
00:12 🔗 chronomex looks like
00:12 🔗 alard underscor: Yes, if there are no more bugs.
00:12 🔗 underscor yay
00:14 🔗 alard I've been running it yesterday and today. Got 185 users done.
00:14 🔗 alard (worth 94GB)
00:14 🔗 underscor Dang
00:15 🔗 underscor How many are there total?
00:15 🔗 underscor Also, is it safe to run concurrent clients?
00:15 🔗 alard I've got a list of 300,000, I think.
00:15 🔗 chronomex jesus shit
00:15 🔗 underscor damn
00:16 🔗 alard I'm not sure if they re all this big. The list I've loaded now is from the wayback machine, the other list is from google.
00:16 🔗 alard Concurrent clients is okay. Although it somehow won't get faster than 1MB/s for me, perhaps Akamai is limiting the speed.
00:17 🔗 underscor - Running wget --mirror (at least 2174 files)...
00:17 🔗 underscor ooh
00:18 🔗 alard Heh. --- But I should really go to sleep now, I'll speak to you later.
00:18 🔗 underscor Adios
00:23 🔗 underscor I'm getting 20-80Mbps from akamai
00:27 🔗 underscor Already at 3G
00:27 🔗 underscor dang
00:50 🔗 underscor http://thechive.files.wordpress.com/2010/04/photo-subtitle-6.jpg
01:02 🔗 Coderjoe I'm appearing to get several seconds of 100-200kbit/s with an occasional burst up to 5-6Mbit/s
02:30 🔗 amerrykan demoscene irl: http://vimeo.com/31158841
02:43 🔗 SketchCow I'd really like to discuss the resulting files.
03:58 🔗 underscor SketchCow: Resulting foles from what?
03:58 🔗 underscor files*
03:59 🔗 underscor amerrykan: What are they doing?
03:59 🔗 underscor Are those birds?
04:03 🔗 BlueMax *tweet*
05:08 🔗 Coderjoe giorgio moroder's restoration of fritz lang's metropolis is coming out on dvd and bluray on the 15th
05:09 🔗 inv with or without soundtrack?
05:10 🔗 inv think it would be awesome at double speed and with the TRON soundtrack
05:10 🔗 inv :)
05:11 🔗 Coderjoe with moroder's soundtrack
09:15 🔗 Coderjoe mm
09:15 🔗 Coderjoe no wonder this is going so slow
09:15 🔗 Coderjoe get keeps hammering the same url
09:15 🔗 Coderjoe https://www.me.com/YouNeedToSpecifyTheAccountName.html:
09:15 🔗 Coderjoe 2011-11-03 09:15:30 ERROR 404: Not Found.
09:16 🔗 ersi hah, oops :)
09:16 🔗 Coderjoe with an occasional good url in there
09:27 🔗 alard Coderjoe: https://www.me.com/YouNeedToSpecifyTheAccountName.html is me.com's way of saying Not Found.
09:30 🔗 Coderjoe $ grep 'https://www.me.com/YouNeedToSpecifyTheAccountName.html' wget.log | wc -l
09:30 🔗 Coderjoe 57183
09:30 🔗 Coderjoe getting more at a rate of 2 per second
09:31 🔗 Coderjoe and getting something else every 15 seconds or so
09:32 🔗 alard Could you run gunzip -c data/h/hu/hub/hubertyamada/web.me.com/web.me.com-hubertyamada.warc.gz | grep "GET /g/" -A 24 | grep HTTP
09:32 🔗 alard but then with the user you're currently getting?
09:33 🔗 alard I expect you'll get a long list with only HTTP 302 responses.
09:33 🔗 Coderjoe yser is timmyers
09:34 🔗 Coderjoe er, user
09:35 🔗 alard Okay, I'll try it in a moment.
09:36 🔗 alard I think the problem is that some pages on web.me.com have links to images, stylesheets, scripts on web.me.com/g/ . /g/ no longer exists and redirects to www.me.com/YouNeedetc
09:36 🔗 Coderjoe 57923 302
09:36 🔗 Coderjoe gunzip -c web.me.com-timmyers.warc.gz | grep "GET /g/" -A 24 | grep "^HTTP" | cut -d " " -f 2 | sort | uniq -c
09:37 🔗 alard Nothing else? Then I think it's safe to start ignoring the /g/ directory. I'm trying that now.
09:37 🔗 Coderjoe that is the entire output of that command
09:42 🔗 alard I've pushed an update to github.
09:48 🔗 Coderjoe i was using single
09:48 🔗 Coderjoe or does single call dld-me-com?
09:48 🔗 Coderjoe I forget
09:48 🔗 Coderjoe i should be in bed sleeping
09:49 🔗 Coderjoe ah
09:49 🔗 Coderjoe alright
10:03 🔗 alard Yes, single calls dld-me-com.
10:04 🔗 Coderjoe that seems to be doing a much better job of downloading stuff
10:04 🔗 Coderjoe averaging 4Mbps
10:07 🔗 Coderjoe well, i'll leave this be while i sleep
10:07 🔗 Coderjoe alard: btw, SketchCow wanted to talk about output files
10:08 🔗 alard Ah, okay, I saw something like that. Let's do that when he's back.
10:09 🔗 alard (Although we've covered most of that before, I think.)
10:29 🔗 RedType looks like i need a new hd
10:30 🔗 RedType its just clicking
10:30 🔗 dnova bad time to need a new hard drive
10:31 🔗 RedType i really fortunately have a spare
10:34 🔗 RedType like i never have extra computer crap but i have one that's the exact same size
10:34 🔗 RedType ffff
10:44 🔗 dnova haha
14:17 🔗 SketchCow Back
14:17 🔗 SketchCow More than the format, which we decided to be WARC, I mean the arrangement of people.
14:17 🔗 SketchCow Because we need to be able to pull in by-site stuff, and arrange.
15:00 🔗 alard SketchCow: What do you suggest?
15:02 🔗 alard Right now, the structure is data/u/us/use/username/domain (where domain is public.me.com, web.me.com, homepage.mac.com and gallery.me.com)
15:02 🔗 SketchCow Approved.
15:02 🔗 SketchCow That's all.
15:03 🔗 SketchCow And then we'll need to write something that ties these into a pretty bow
15:14 🔗 alard Dilemma: should the script stop if wget returns a network error?
15:14 🔗 alard Yes, of course, since that is not a complete download.
15:15 🔗 alard But also no: it also returns a network error if it can't resolve a hostname. (A few users have links to domains that no longer exist.)
15:24 🔗 alard Solution: remove the urls from user-owned domains (the content is always the same as the content on web.me.com anyway).
17:38 🔗 yipdw huh, wow -- public.me.com actually uses HTTP 207
17:38 🔗 yipdw I haven't seen that response in quite a while
18:03 🔗 Coderjoe alard: in the future, perhaps the script(s) should report to the tracker what version was used to fetch an item, for record-keeping, in the case of some major bugfix or something.
18:17 🔗 Coderjoe https://www.google.com/search?q=do+a+barrel+roll
19:02 🔗 alard Coderjoe: good idea, the scripts now include a version number.
19:20 🔗 underscor 31 users from me.com
19:20 🔗 underscor 64GB
19:20 🔗 underscor Dang!
19:25 🔗 underscor http://www.google.com/search?hl=en&q=askew
20:03 🔗 alard underscor (and chronomex, Coderjoe): If you haven't yet done so today, please do a git pull and run ./fix-dld.sh <yournick> to get the latest web.me.com data.
20:07 🔗 underscor alard: If I git pulled and started some new ones, is fix-dld smart enough to not redo those?
20:07 🔗 alard Yes. It looks for the wget-discover.log, which only the good ones have.
20:08 🔗 underscor aha
20:08 🔗 underscor It doesn't do anything, it seems
20:09 🔗 underscor 0 8:09PM:abuie@abuie-dev:/2/mobileme-grab 9660 π ./fix-dld.sh underscor
20:09 🔗 underscor 0 8:09PM:abuie@abuie-dev:/2/mobileme-grab 9661 π
20:12 🔗 alard Then you probably started with a good version.
20:13 🔗 underscor Oh okay, excellent
20:13 🔗 underscor Yeah, I started late last night
20:13 🔗 alard The new version of the script may be somewhat faster than the version you started with.
20:13 🔗 underscor Yeah, I saw it's filtering out a bunch of 404s now
20:14 🔗 alard Yes, and wget now won't download urls more than once.
20:14 🔗 underscor Ooh, nice
20:15 🔗 underscor Whee, 79GB now
20:15 🔗 underscor Getting over 100mbps from akamai
20:27 🔗 alard That's nice. For some reason my downloads stick at 1-2 MB/s.
20:27 🔗 alard Too far away from the source, perhaps.
20:36 🔗 underscor Possibly
20:37 🔗 underscor Although you'd think they have edge nodes near you
20:37 🔗 bsmith094 2mb/s is SLOW!?
20:39 🔗 alard Relatively speaking, yes. My connection can do 4MB/s.
20:39 🔗 ersi bsmith094: fuck yes it is
20:40 🔗 bsmith094 WTH are you running on?!
20:41 🔗 bsmith094 i have cable broadband ive SEEN 4mb/s for about 500ms but i usually top out at 1.8
20:46 🔗 underscor I'm on FiOS for my house
20:47 🔗 underscor so I top out at 30mb/s
20:47 🔗 underscor But this is running on 10gigE
21:03 🔗 alard Is the owner of 38.104.224.202 here?
21:04 🔗 Coderjoe alard: http://pastebin.com/6jABAcxV
21:05 🔗 Coderjoe my ec2 node is only doing 4Mbps with two downloaders running
21:05 🔗 alard Coderjoe: Can you search data/h/hy/hyp/hyphotoad1024/web.me.com/wget.log for the error? Probably something with Cannot.
21:06 🔗 Coderjoe several wget-warc: unable to resolve host address `www.mslphotography.biz'
21:07 🔗 alard Okay, then you're probably running an old version of the script.
21:08 🔗 alard From one of the feeds on the site it gets urls from www.mslphotography.biz, probably the domain owned by that user. The domain no longer exists, wget returns a network error which is picked up by the script.
21:08 🔗 Coderjoe currently running commit b5584db0f
21:08 🔗 alard This happens with a few accounts I've seen so far. The updated version of dld-me-com.sh only downloads from urls on web.me.com from the feeds.
21:09 🔗 alard b5584db0f is a bit old.
21:09 🔗 Coderjoe it killed the dld-client loop when it hit that error, too
21:09 🔗 alard The latest version doesn't have this error and is also faster.
21:10 🔗 Coderjoe i'm waiting for the other client to finish
21:10 🔗 alard The dld-client looks for the wget exit code. If there's anything wrong (not being a 404 or not authenticated error) it stops.
21:11 🔗 alard You could try updating while the old one is running. Download the new version, touch STOP, run a new client. The old client will stop, the new client will keep running.
21:12 🔗 alard (The new version looks at the modification time of the STOP file.)
21:12 🔗 alard Meanwhile, someone is sending silly data to the tracker.
21:16 🔗 RedType so i dont think it's my hd that died so hopefully i'll be back online downloading
21:16 🔗 RedType when the new motherboard comes :/
21:17 🔗 RedType i was finally able to put some bandwidth and disk in and it went to shit :p
22:13 🔗 SketchCow Well, OK.
22:14 🔗 SketchCow ----------------------------------------------------------------
22:14 🔗 SketchCow How about THIS shizzle - AOL is going to turn off 650 Mailing Lists (LISTSERVS) in 30 days
22:14 🔗 SketchCow Let's track those things down and squirrel away those 200mb or whatever
22:14 🔗 SketchCow ----------------------------------------------------------------
22:17 🔗 SketchCow http://www.infoworld.com/d/cloud-computing/aol-discontinues-listserv-mailing-list-service-177939
22:17 🔗 SketchCow http://listserv.aol.com/
22:17 🔗 SketchCow Someone kindly WARC that bitch
22:53 🔗 frogzilla Hi, I'm a writer for the MIT Technology Review doing an article on Archive Team. Today I'm looking for Qwerty0, or anyone who has his email address, or anyone who wants to talk about Poetry.com. PM me if any of this applies. Thank you and /spam.
22:57 🔗 SketchCow Are you the guy's been calling me?
22:59 🔗 frogzilla Nope. Also, not a guy.
23:06 🔗 frogzilla Oh, you're Jason. Hello! I was hoping to talk to one more person from Archive Team, and my editor thinks the Poetry.com story is nicely "colorful," so here I am, lurking :)
23:23 🔗 SketchCow OK.
23:23 🔗 SketchCow There was another person on this tear to interview me and I heard no good words.
23:24 🔗 SketchCow Of the people here, if it matters what my thoughts are, chronomex, underscor and a few others were involved in poetry.com
23:24 🔗 SketchCow He's not online right now, but BlueMax was the kid who got threatened.
23:25 🔗 frogzilla Thanks!
23:25 🔗 frogzilla all good to know
23:27 🔗 chronomex hi
23:27 🔗 * chronomex peeks in
23:28 🔗 chronomex alard: okay. ill kick that shit off when i get home, maybe 5 hours from now
23:37 🔗 Coderjoe SketchCow: I have a wget-warc on it. they don't have a robots.txt however.
23:38 🔗 SketchCow Yeah
23:38 🔗 SketchCow Issue is we need to get past the passwording thing.
23:38 🔗 SketchCow And some lists aren't listed.
23:38 🔗 Coderjoe oh.
23:39 🔗 Coderjoe well that is ass
23:43 🔗 Coderjoe i have a password. now I just need to feed a cookie to wget
23:45 🔗 SketchCow HELLO
23:45 🔗 SketchCow ---------------------------------------------------------
23:45 🔗 SketchCow http://www.nature.com/scientificamerican/archive/index.html
23:45 🔗 SketchCow Just. Do. It.
23:45 🔗 SketchCow ---------------------------------------------------------
23:47 🔗 alard SketchCow/Coderjoe: Email may also be a way in. You can subscribe and ask for email archives.

irclogger-viewer