[00:35] OK, so I should put out the word [00:35] Do we have a way for a newble who comes in for the first time to get the right info? [00:37] Let me know, we should probably make a #splinder. [00:38] #splinder made [01:52] First time archiver, gonna start out with a few Splinder grabs. [01:57] Do it! [01:57] Be a hero! [01:58] (i'm installing it now) [01:58] :) [01:59] hmm need to install gnutls-dev [02:05] done, running [02:19] Live stuff is brilliant. [02:20] Yeah, this dashboard is fantastic. [02:27] SketchCow: suprised it isn't generalized though [02:30] what's going on with EC2 pricing and availability? I haven't used it in a while and it seems like spot instances are about 10x more expensive than they were, and there are no "small" instances available at all, at least in US and EU. [02:37] oh, the lack of "small" instances was my own error [02:38] but the spot market is still whackola. [03:03] what's a good # of theads to run for splinder [03:03] for low cpu [03:04] closure: yes. I wrote splinder, that behavior was a conscious change from dld-client. I'm going to change it to match -client when I get home in 6 hours or so. [03:04] chronomex: What's the best way to find stuck threads with streamer? [03:04] hmm? [03:05] Like, I think some of these are in the try-fail loop. [03:06] how many threads do you run underscor [03:06] 290 right now [03:06] Adding another 160 [03:06] oh. [03:07] I wonder what I can get away with [03:08] it seems to have stalled at getting usernames [03:08] 29/100 Getting next username from tracker... downloading it:larimar07 [03:08] 30/100 Getting next username from tracker... [03:08] No more usernames available. Entering reap mode... [03:08] hmph. [03:10] I find that it works well with 50-100 threads per streamer, running multiple streamers at once [03:10] same directory is ok for multiple streamers? [03:11] for now, you can turn off that reap mode by changing "fork_more=0" near that message to "sleep 20" or something [03:11] same directory is just fine [03:11] isn't reap mode good? [03:12] reap mode is when it decides to quit gracefully [03:12] oh. [03:15] why is the tracker not filling me up with profiles to grab? [03:17] probably a transient connection problem [03:17] I get a bunch real fast, then it just stops. [03:19] any way to see individual stats in real time on the tracker page? [03:19] which is -fucking- cool by the way [03:20] sure, just get on top of the leaderboard [03:20] closure: MAYBE if you shut down [03:20] and ndurner and underscor [03:20] :D [03:21] ha [03:22] so, I recommend this patch to anyone who's getting retry stalls on us: http://pastebin.com/MnM1x899 [03:22] with that, it retries 5 times and then gives up and moves on [03:23] closure: do a pull request on github [03:29] ok I have a bunch of streamers running now, many of them successfully pulling profiles from the tracker [03:29] this is fantastic [03:30] I spun up a vm, followed the instructions on the wiki (I've never even used git before) and I was running in <10 mins [03:30] :) [03:30] fantastic [03:30] I think we'll be using derivatives of these scripts a lot [03:31] they're the best we've used yet .. thanks alard for getting it started :) [03:33] gotta head home. I'll be back in a bit. [03:33] cheers [04:19] I want to move the location of the scripts and data. Is it sufficient to mv the directory or is it more complicated than that [04:19] (after the scripts stop) [04:20] someone archived cranswick lachlan's page some time ago, where can i find it? [04:24] he was a nuclear physicist that died, wrote a lot about almost everything [04:39] going to hit 700k soon [04:42] Just got a "no more usernames available. Entering reap mode." message. Correct, or a tracker error? [04:42] tracker error [04:42] I was getting that a lot earlier [04:42] yeah [04:42] there's a patch for it [04:43] Open dld-streamer and change fork_more=0 to sleep 20 [04:43] Will do. thx. [04:44] 700k! [04:44] ndurner got it [04:45] :) [04:47] my god... the web tracker updates faster than my ssh session [04:48] ha [04:49] there's a bit of a delay in the streamer, as the script waits for the http call to return before quitting [04:50] ah [04:50] still funny :D [04:50] yep [04:57] how are incompletes handled? [05:03] ? [05:05] if the script exited ungracefully etc [05:05] use dld-single.sh to redownload them [05:06] how do I know which ones to get? the ones that are still sitting there in the root directory when all scripts are stopped? [05:06] look for the .incomplete files [05:06] ok. [05:07] it can't handle them automatically because it can't distinguish between one where the downloader was killed and one where the downloader is still running but isn't finished [05:08] hmm [05:08] maybe I should add a check for incompletes. [05:08] aren't all the incomplete ones just the ones that still have .logs in the spindler-grab/ ? [05:09] should also be a good indicator. [05:10] I don't trust that 100% tho [05:11] how come? [05:13] are the entries in downloads.logs only complete ones? [05:16] .log rather [05:18] someone gave me a patch to merge into the downloader to stop the infinite retry loop, how do i patch [05:18] this http://pastebin.com/MnM1x899 [05:22] I have about 700 users in 2.5 hours [05:22] not too bad [05:22] 280U/hr [05:23] 26 hours until I make it to the leaderboard assuming it doesn't change at all on the low end and nobody else enters it :P [05:26] how do i check how many users i have [05:26] also, has anyone else noticed that these user pages are, in total, very very small? [05:27] there's probably a better way but: cat -n downloads.log [05:27] bsmith094: yea [05:27] i have jpgs bigger thean any 4 of mine put together [05:28] another thing, who writes these scripts, every site is different, you'd need custom code for every job [05:29] someone with serious skills [05:29] yes, we generally do have different code for every job [05:30] still if it was me, and its not, i'd put my username in a comment or something [05:30] not really necessary [05:30] https://github.com/ArchiveTeam/splinder-grab/commits/master [05:32] still, some serious work was put into this, parallelizing these download jobs., creating a tracker, grabbing the list of users ( how'd they do that?) and then actually coding all that [05:32] yep [05:32] of course, we do stand on the shoulders of giants [05:33] k then my bad, alard, chronomex, et all good work :) [05:33] we didn't actually have to write wget, although we did extend it to create the warc files [05:33] what does the warc add on actually do, anyway [05:34] ok I know I have a bunch of incompletes in one splinder-grab directory. No more scripts are running there [05:34] I don't know how to remedy the incompletes [05:34] it makes wget store everything it downloads in a warc file that contains the request and response headers in addition to the data [05:35] and those are important, why, exactly [05:35] plus metadata about how and when the request was made [05:38] im running this in a terminal, can i safely run multiple instances? [05:38] yep [05:38] yes [05:40] im just opening multiple windows, then and running the script in each one ( sorry for the noob questions, but i dont really do much with the terminal except apt get upgrades) [05:40] yea, that works [05:40] that's about right. [05:40] you could also use screen or tmux [05:40] I've got a crazy screen session going, myself [05:41] what is screen\ [05:42] apt-get install screen [05:42] man screen [05:42] it's umm... [05:42] a thing [05:42] Screen is a full-screen window manager that multiplexes a physical ter- minal between several processes (typically interactive shells). [05:45] ive got 6 termials open [05:47] should it be downloading faster? [05:47] it's pretty slow [05:47] im using hardly any of any of my resources [05:47] closure and ndurner are cheating somehow!! [05:47] it doesn't help that we are hammering them [05:48] minimal bandwidth, minimal disk i/o, which is probably a good thing, min cpu, the fans arent even going on [05:49] yea, any individual downloader will spend most of it's time waiting [05:49] im just gonna open like 10 more terminal windows then [05:49] if you want, there's a script called dld-streamer.sh that can run dozens of downloaders for you [05:50] where [05:50] wow duh ok i found it [05:51] :) [05:51] hey whats the check for the latest code command [05:52] i know nothing about git [05:52] git pull [05:53] "git pull" from within splinder-grab [05:53] yep [05:53] git pull or just git pull [05:54] dld-streamer 30 sounds good [05:55] entering reap mode? [05:55] means the tracker didn't supply enough profile names [05:55] so it is gracefully exiting (downloading the users it DID get from the tracker) [05:55] so let that finish [05:55] so ittl get more when more become availible [05:55] no, it will exit when it finishes the list that it got [05:56] oh, it says yes it will, this is a very smart script [05:56] spits back pids and everything [06:02] I just hit 1,000 users [06:18] how many is closure running! [06:18] I think closure secretly owns splinder [06:18] has anyone gotten a profile thats over 2 mb yet ( in the last hour) [06:19] I'm not sure about myself [06:19] I'm averaging just a little under 1mb/user [06:19] I'm not giant, I'm only 6'0.5" ;) [06:19] bsmith094: not lately, but yes. [06:19] most of mine are smaller sthan most of the text files i have [06:22] heh [06:22] well it *is* gzip'd [06:25] it is? [06:25] 06:25:24 up 3:30, 21 users, load average: 21.27, 11.13, 6.12 [06:25] O_O [06:25] what he said, i didnt know wither [06:29] trying to keep my resources maxed out [06:29] cpu load varies dramatically, heh [06:29] 06:30:01 up 3:34, 22 users, load average: 2.97, 6.96, 5.74 [06:32] there's a stage that's cpu-intensive, it's usually short [06:32] so it'll go up and down [06:32] I suggest running it with nice and ionice. [06:32] I don't know what that means [06:32] nice ionice ./dld-streamer.sh you whatever [06:32] but I am perfectly content to max out the box I am running it on [06:33] what do those do? [06:37] man nice; man ionice [06:38] they still max it out, they just allow you to run other things at higher priority [06:40] ah [06:40] I read the intros in the man pages too [06:41] but the sole purpose of the vm I'm running this on is to run this. [06:41] so I think I don't need those [06:53] very well [06:54] :) [06:56] underscor: btw, your bot appears to have died [06:57] whoops [06:57] Coderjoe: THanks [07:04] how is it tracked what users get uploaded safely to batcave? [07:10] 710k [07:28] http://www.reddit.com/r/linux/comments/mi80x/give_me_that_one_command_you_wish_you_knew_years/c3182v9 [07:37] underscor: o_o mind blown [07:39] there's a ton of bash builtins I don't know jack about [07:39] i knew about ^Z and fg, but not bg and disown [07:39] I blame the bash manpage; it shoves them all into a section that's near the end [07:39] "BUGS [07:39] It's too big and too slow. [07:39] " [07:39] I agree [07:39] the bash manpage is too damn long [07:39] well, on the big part [07:40] I also didn't know that bash did associative arrays until I saw it in chronomex's dld-streamer [07:40] what? you didn't see it in my chunky script? [07:40] or just didn't look into the internals of chunky? [07:41] the latter [07:41] chunky was also a much longer program [07:42] Man, mbuffer is delicious too [07:42] associative arrays were a somewhat recent addition to bash [07:42] It's like pv, but does in-memory buffering to smooth io operations [07:42] Also works across the network [07:42] Well, of course it works across the network [07:42] underscor: probably somewhat derived or inspired by buffer [07:43] haha [07:44] i've used buffer quite a bit to see what speeds I was getting on a tarpipe from a remote host [07:45] I see [07:47] i'd use a tarpipe to transfer the files because the tar on the remote end would already know what files it needed to package, so there wouldn't be the pipeline stalls of my end telling the remote end what to do [07:47] yeah. tar is great to stream over a network. [07:48] (if something broke during the tar transfer, I would then use rsync) [07:48] rsync is pretty good at that too, but yes. [07:48] you could use find to generate a list of candidates, then rsync at the end to catch the little pieces of things [07:52] is it just me, or is us.splinder.com really getting killed [07:56] $ find data -mindepth 5 -maxdepth 5 -type d|cut -d / -f 2|uniq -c [07:56] 11 us [07:56] 14 it [07:56] not a big sample size, though [08:18] Oh, wow, we're not gonna finish splinder at the current rate [08:19] maybe if they got their fucking application servers in line [08:19] lol [08:19] I mean, really [08:20] - Downloading blog from gothic-pride.splinder.com... done, with HTTP errors. [08:20] - Checking for important 502, 504 errors... errors found. [08:20] I've got eight processes running US downloads and similar stuff has happened to all of htem [08:20] chronomex: What will happen if I kill a process? [08:20] (what will streamer do?) [08:22] underscor: I don't think adding more workers is going to speed it up [08:22] at least not for the US site [08:22] yipdw: Yeah, I think we're "maxing" it out [08:22] yipdw: cancel it and do a git pull [08:22] alard: can we get a way to only request accounts from it splinder [08:22] db48x: I did [08:23] Coderjoe just checked in a patch that makes it stop after 5 retries [08:23] 4 retries [08:29] db48x: yeah, I'm running with it now [08:29] it should at least permit better progress [08:31] yikes, these large EC2 instances are expensive [08:31] $0.34/hour [08:31] this one I'm running might have to shut down in a couple of days :P [08:31] well, wait, we have four days left to download (assuming the 24th is a real deadline) [08:32] fuck it, I'll just max it for the next four days [08:33] smrt [08:35] s-m-r-t, I mean s-m-a-r-t [08:48] hm, I got the "no more usernames available" message from the streamer, running from HEAD [08:48] are all three tracker instances hit that hard? [09:09] I only have one micro instance left running. I ran up a bill that is already larger than I would have liked with a spot m2.xlarge instance >_< [09:24] how much is the damage? [09:25] I think I blew through my monthly download limit in a week, I shall find out tomorrow if I'm over [09:25] Cameron_D: heh [09:26] comcast isn't very happy with me either [09:26] 3-4TB a month since I signed up [09:26] Hah, wow [09:26] I have a 500gb limit split in on/off peak [09:27] they give me a call every month and they are always surprised when I say that that's the amount I intend to use next month [09:28] damn. splinder is fucked as shit. why isn't my shit going. [09:28] heh, what is their fair use policy like? [09:28] maybe they got tired of us and are blocking IPs :P [09:28] nah [09:28] no, it's been itermitant [09:29] here we go [09:29] man, alard, every time I use this stuff you made I'm happy. [09:30] hrm [09:30] I can't load google read or google mail [09:30] too much internets! [09:31] alard: for the next rework of this, I suggest leaving the notification-of-done to the client/streamer. I'd like to detect all tracker failures and implement a backoff, but I'm not quite sure how to do that now [09:33] I mean, if you're expiring units then it's fail-secure. but I'd still like to :) [09:38] another improvement we should make is to wget [09:38] if wget could retry until it got a success instead of a failure, then we wouldn't have to delete the whole user and start again [09:39] barring that, we could retry each phase separately [09:39] hmm. [09:45] db48x: $91 for transfer and $53 for the instance [09:45] mmm [09:47] and the clock is still running on some stuff i stashed in s3. that's going to sit until next month, as the storage cost is less than the transfer cost would be. [09:47] ouch [09:49] is anyone still running mobileme clients? the graph looks a bit flat [09:50] all my systems went over to Splinder [09:50] yea, splinder is more urgent [09:53] splinder streamer updated to continue in face of tracker unavailable, git pull if you want [09:53] I do want [09:54] it's a simple stupid fix. [09:56] oh, I see [09:56] meh, it works [09:56] I highly doubt we're going to hit problems with the tracker actually being out of usernames [09:56] although, we're almost below 600,000 [09:57] yea [09:57] tracker unavailable as in not responding. [09:59] chronomex: is this your script? [09:59] dld-streamer.sh is mine. [09:59] look up bash's "wait" builtin (re comment starting at 108) [10:00] I took alard's dld-client and modified it [10:00] Coderjoe: I did. I can't wait for "next child to exit", so I spin around every second or so and check each one. [10:00] the comment isn't very well phrased. [10:01] also there's some asymmetry over how it handles the two events of concern [10:01] it starts a new client once per loop, but it reaps dead children an unlimited number of times per loop [10:02] partly that's because of how I did it, but partly it's because I think it's better design to allow the counter to sometimes go down faster than it can go up [10:02] you could have looked at how chunky works and modify that... [10:02] chunky? [10:02] i have a whole bunch of code there to keep x number of processes running and stuf [10:02] from the friendster project [10:02] hm. [10:03] ah, right. I wasn't paying attention to #archiveteam that month. [10:03] christ, the thing I don't participate in is the thing that everyone from the media fucking cares about [10:03] even a half-decent UI to change the number of children, report what children were running, and cleanly exit [10:03] I suppose I should read my irc logs. [10:03] hmmmm. [10:03] I'll see what I can pull into the streamer from that [10:04] I'd like to turn this tracker/streamer combo into something rapidly deployable for future projects [10:04] the tracker in particular makes me really want to win the leaderboard. [10:05] or, shit, even show up anywhere. [10:06] the media cares about the friendster project? [10:06] strangely enough [10:06] they don't give a shit about geocities, mostly, I'm not sure why [10:06] "but these guys archived friendster!!" [10:06] maybe if I had ever used friendster I might understand [10:07] geocities is too old for many to remember, perhaps [10:07] fuck man I'm 24 and I remember geocities. [10:07] I even have a goddamn book on it [10:07] but friendster was noteworthy for being the first big social networking site [10:08] hrm. [10:08] tbh I kinda didn't know about friendster more than as "something that people who write for wired magazine talk about" until #archiveteam hit it like ten tons of bricks [10:11] heh [10:11] on the splinder wiki page: "Downloading one user's data can take between 10 seconds and a few hours." [10:11] I've still got one going since the 13th [10:12] HMMMM. [10:12] it is, amazingly, making progress [10:12] excellent. [10:12] heh, cool [10:12] how big is it so far? [10:13] 837M [10:13] one blog [10:13] journal.splinder.com [10:13] cheeerist [10:13] oh [10:13] I've bitched about it before, but I'm amazed it's still going on [10:13] in a world [10:13] one blog [10:13] one man [10:14] can download it all [10:14] they should have just hosted Splinder on EC2 [10:14] heh [10:15] yeah! back when ec2 didn't exist and splinder did, they should have moved it to a nonexistent platform service instead of buying metal! [10:15] yes [10:15] time travel [10:15] well, you know. neutrinos. [10:15] solves all infrastructure problems [10:15] heh [10:15] frsrs, yipdw [10:15] it even makes Rails scale [10:16] doesn't quite work on mysql, though [10:16] heh [10:16] I've not used MySQL in years [10:17] how is it nowadays? [10:17] idfk. [10:17] I don't like it. [10:17] it's a decent key-value store masquerading as a shitty excuse for a relational database [10:17] actually, if time travel were possible we could integrate it directly into our cpus [10:17] long while ago, I installed Postgres and just haven't found a reason to hate it [10:18] if you want relational database, use postgres. if you want nonrelational datastore, there are loads of good choices. [10:18] relational datastore is very hard problem. [10:18] there's been a lot of study of circuits with time travel capabilities [10:18] that's why there are so many free nonrelational datastores ;) [10:19] time travel is probably impossible though, since it turns a turing machine into a hyperturing machine [10:21] that's got to be the strangest explanation for why time travel is impossible that I've heard in a while [10:22] heh [10:22] yipdw: It's not possible to ask the tracker for splinder-IT accounts only, though I could move the US accounts to a separate tracker, if that's useful. [10:23] you think that'd be good? [10:23] alard: it's not really a huge problem now that the retry logic has been fixed [10:23] hardly a fix, just kind of a shitty bandaid [10:23] the real problem I was hitting is that a bunch of workers would just get stalled out on 5xx errors [10:23] well [10:23] it's Good Enough [10:23] :P [10:23] not much one can do about application servers or reverse proxies crapping out [10:24] DOWNLOAD HARDER [10:24]