[00:35] OK, so I should put out the word [00:35] Do we have a way for a newble who comes in for the first time to get the right info? [00:37] Let me know, we should probably make a #splinder. [00:38] #splinder made [01:52] First time archiver, gonna start out with a few Splinder grabs. [01:57] Do it! [01:57] Be a hero! [01:58] (i'm installing it now) [01:58] :) [01:59] hmm need to install gnutls-dev [02:05] done, running [02:19] Live stuff is brilliant. [02:20] Yeah, this dashboard is fantastic. [02:27] SketchCow: suprised it isn't generalized though [02:30] what's going on with EC2 pricing and availability? I haven't used it in a while and it seems like spot instances are about 10x more expensive than they were, and there are no "small" instances available at all, at least in US and EU. [02:37] oh, the lack of "small" instances was my own error [02:38] but the spot market is still whackola. [03:03] what's a good # of theads to run for splinder [03:03] for low cpu [03:04] closure: yes. I wrote splinder, that behavior was a conscious change from dld-client. I'm going to change it to match -client when I get home in 6 hours or so. [03:04] chronomex: What's the best way to find stuck threads with streamer? [03:04] hmm? [03:05] Like, I think some of these are in the try-fail loop. [03:06] how many threads do you run underscor [03:06] 290 right now [03:06] Adding another 160 [03:06] oh. [03:07] I wonder what I can get away with [03:08] it seems to have stalled at getting usernames [03:08] 29/100 Getting next username from tracker... downloading it:larimar07 [03:08] 30/100 Getting next username from tracker... [03:08] No more usernames available. Entering reap mode... [03:08] hmph. [03:10] I find that it works well with 50-100 threads per streamer, running multiple streamers at once [03:10] same directory is ok for multiple streamers? [03:11] for now, you can turn off that reap mode by changing "fork_more=0" near that message to "sleep 20" or something [03:11] same directory is just fine [03:11] isn't reap mode good? [03:12] reap mode is when it decides to quit gracefully [03:12] oh. [03:15] why is the tracker not filling me up with profiles to grab? [03:17] probably a transient connection problem [03:17] I get a bunch real fast, then it just stops. [03:19] any way to see individual stats in real time on the tracker page? [03:19] which is -fucking- cool by the way [03:20] sure, just get on top of the leaderboard [03:20] closure: MAYBE if you shut down [03:20] and ndurner and underscor [03:20] :D [03:21] ha [03:22] so, I recommend this patch to anyone who's getting retry stalls on us: http://pastebin.com/MnM1x899 [03:22] with that, it retries 5 times and then gives up and moves on [03:23] closure: do a pull request on github [03:29] ok I have a bunch of streamers running now, many of them successfully pulling profiles from the tracker [03:29] this is fantastic [03:30] I spun up a vm, followed the instructions on the wiki (I've never even used git before) and I was running in <10 mins [03:30] :) [03:30] fantastic [03:30] I think we'll be using derivatives of these scripts a lot [03:31] they're the best we've used yet .. thanks alard for getting it started :) [03:33] gotta head home. I'll be back in a bit. [03:33] cheers [04:19] I want to move the location of the scripts and data. Is it sufficient to mv the directory or is it more complicated than that [04:19] (after the scripts stop) [04:20] someone archived cranswick lachlan's page some time ago, where can i find it? [04:24] he was a nuclear physicist that died, wrote a lot about almost everything [04:39] going to hit 700k soon [04:42] Just got a "no more usernames available. Entering reap mode." message. Correct, or a tracker error? [04:42] tracker error [04:42] I was getting that a lot earlier [04:42] yeah [04:42] there's a patch for it [04:43] Open dld-streamer and change fork_more=0 to sleep 20 [04:43] Will do. thx. [04:44] 700k! [04:44] ndurner got it [04:45] :) [04:47] my god... the web tracker updates faster than my ssh session [04:48] ha [04:49] there's a bit of a delay in the streamer, as the script waits for the http call to return before quitting [04:50] ah [04:50] still funny :D [04:50] yep [04:57] how are incompletes handled? [05:03] ? [05:05] if the script exited ungracefully etc [05:05] use dld-single.sh to redownload them [05:06] how do I know which ones to get? the ones that are still sitting there in the root directory when all scripts are stopped? [05:06] look for the .incomplete files [05:06] ok. [05:07] it can't handle them automatically because it can't distinguish between one where the downloader was killed and one where the downloader is still running but isn't finished [05:08] hmm [05:08] maybe I should add a check for incompletes. [05:08] aren't all the incomplete ones just the ones that still have .logs in the spindler-grab/ ? [05:09] should also be a good indicator. [05:10] I don't trust that 100% tho [05:11] how come? [05:13] are the entries in downloads.logs only complete ones? [05:16] .log rather [05:18] someone gave me a patch to merge into the downloader to stop the infinite retry loop, how do i patch [05:18] this http://pastebin.com/MnM1x899 [05:22] I have about 700 users in 2.5 hours [05:22] not too bad [05:22] 280U/hr [05:23] 26 hours until I make it to the leaderboard assuming it doesn't change at all on the low end and nobody else enters it :P [05:26] how do i check how many users i have [05:26] also, has anyone else noticed that these user pages are, in total, very very small? [05:27] there's probably a better way but: cat -n downloads.log [05:27] bsmith094: yea [05:27] i have jpgs bigger thean any 4 of mine put together [05:28] another thing, who writes these scripts, every site is different, you'd need custom code for every job [05:29] someone with serious skills [05:29] yes, we generally do have different code for every job [05:30] still if it was me, and its not, i'd put my username in a comment or something [05:30] not really necessary [05:30] https://github.com/ArchiveTeam/splinder-grab/commits/master [05:32] still, some serious work was put into this, parallelizing these download jobs., creating a tracker, grabbing the list of users ( how'd they do that?) and then actually coding all that [05:32] yep [05:32] of course, we do stand on the shoulders of giants [05:33] k then my bad, alard, chronomex, et all good work :) [05:33] we didn't actually have to write wget, although we did extend it to create the warc files [05:33] what does the warc add on actually do, anyway [05:34] ok I know I have a bunch of incompletes in one splinder-grab directory. No more scripts are running there [05:34] I don't know how to remedy the incompletes [05:34] it makes wget store everything it downloads in a warc file that contains the request and response headers in addition to the data [05:35] and those are important, why, exactly [05:35] plus metadata about how and when the request was made [05:38] im running this in a terminal, can i safely run multiple instances? [05:38] yep [05:38] yes [05:40] im just opening multiple windows, then and running the script in each one ( sorry for the noob questions, but i dont really do much with the terminal except apt get upgrades) [05:40] yea, that works [05:40] that's about right. [05:40] you could also use screen or tmux [05:40] I've got a crazy screen session going, myself [05:41] what is screen\ [05:42] apt-get install screen [05:42] man screen [05:42] it's umm... [05:42] a thing [05:42] Screen is a full-screen window manager that multiplexes a physical ter- minal between several processes (typically interactive shells). [05:45] ive got 6 termials open [05:47] should it be downloading faster? [05:47] it's pretty slow [05:47] im using hardly any of any of my resources [05:47] closure and ndurner are cheating somehow!! [05:47] it doesn't help that we are hammering them [05:48] minimal bandwidth, minimal disk i/o, which is probably a good thing, min cpu, the fans arent even going on [05:49] yea, any individual downloader will spend most of it's time waiting [05:49] im just gonna open like 10 more terminal windows then [05:49] if you want, there's a script called dld-streamer.sh that can run dozens of downloaders for you [05:50] where [05:50] wow duh ok i found it [05:51] :) [05:51] hey whats the check for the latest code command [05:52] i know nothing about git [05:52] git pull [05:53] "git pull" from within splinder-grab [05:53] yep [05:53] git pull or just git pull [05:54] dld-streamer 30 sounds good [05:55] entering reap mode? [05:55] means the tracker didn't supply enough profile names [05:55] so it is gracefully exiting (downloading the users it DID get from the tracker) [05:55] so let that finish [05:55] so ittl get more when more become availible [05:55] no, it will exit when it finishes the list that it got [05:56] oh, it says yes it will, this is a very smart script [05:56] spits back pids and everything [06:02] I just hit 1,000 users [06:18] how many is closure running! [06:18] I think closure secretly owns splinder [06:18] has anyone gotten a profile thats over 2 mb yet ( in the last hour) [06:19] I'm not sure about myself [06:19] I'm averaging just a little under 1mb/user [06:19] I'm not giant, I'm only 6'0.5" ;) [06:19] bsmith094: not lately, but yes. [06:19] most of mine are smaller sthan most of the text files i have [06:22] heh [06:22] well it *is* gzip'd [06:25] it is? [06:25] 06:25:24 up 3:30, 21 users, load average: 21.27, 11.13, 6.12 [06:25] O_O [06:25] what he said, i didnt know wither [06:29] trying to keep my resources maxed out [06:29] cpu load varies dramatically, heh [06:29] 06:30:01 up 3:34, 22 users, load average: 2.97, 6.96, 5.74 [06:32] there's a stage that's cpu-intensive, it's usually short [06:32] so it'll go up and down [06:32] I suggest running it with nice and ionice. [06:32] I don't know what that means [06:32] nice ionice ./dld-streamer.sh you whatever [06:32] but I am perfectly content to max out the box I am running it on [06:33] what do those do? [06:37] man nice; man ionice [06:38] they still max it out, they just allow you to run other things at higher priority [06:40] ah [06:40] I read the intros in the man pages too [06:41] but the sole purpose of the vm I'm running this on is to run this. [06:41] so I think I don't need those [06:53] very well [06:54] :) [06:56] underscor: btw, your bot appears to have died [06:57] whoops [06:57] Coderjoe: THanks [07:04] how is it tracked what users get uploaded safely to batcave? [07:10] 710k [07:28] http://www.reddit.com/r/linux/comments/mi80x/give_me_that_one_command_you_wish_you_knew_years/c3182v9 [07:37] underscor: o_o mind blown [07:39] there's a ton of bash builtins I don't know jack about [07:39] i knew about ^Z and fg, but not bg and disown [07:39] I blame the bash manpage; it shoves them all into a section that's near the end [07:39] "BUGS [07:39] It's too big and too slow. [07:39] " [07:39] I agree [07:39] the bash manpage is too damn long [07:39] well, on the big part [07:40] I also didn't know that bash did associative arrays until I saw it in chronomex's dld-streamer [07:40] what? you didn't see it in my chunky script? [07:40] or just didn't look into the internals of chunky? [07:41] the latter [07:41] chunky was also a much longer program [07:42] Man, mbuffer is delicious too [07:42] associative arrays were a somewhat recent addition to bash [07:42] It's like pv, but does in-memory buffering to smooth io operations [07:42] Also works across the network [07:42] Well, of course it works across the network [07:42] underscor: probably somewhat derived or inspired by buffer [07:43] haha [07:44] i've used buffer quite a bit to see what speeds I was getting on a tarpipe from a remote host [07:45] I see [07:47] i'd use a tarpipe to transfer the files because the tar on the remote end would already know what files it needed to package, so there wouldn't be the pipeline stalls of my end telling the remote end what to do [07:47] yeah. tar is great to stream over a network. [07:48] (if something broke during the tar transfer, I would then use rsync) [07:48] rsync is pretty good at that too, but yes. [07:48] you could use find to generate a list of candidates, then rsync at the end to catch the little pieces of things [07:52] is it just me, or is us.splinder.com really getting killed [07:56] $ find data -mindepth 5 -maxdepth 5 -type d|cut -d / -f 2|uniq -c [07:56] 11 us [07:56] 14 it [07:56] not a big sample size, though [08:18] Oh, wow, we're not gonna finish splinder at the current rate [08:19] maybe if they got their fucking application servers in line [08:19] lol [08:19] I mean, really [08:20] - Downloading blog from gothic-pride.splinder.com... done, with HTTP errors. [08:20] - Checking for important 502, 504 errors... errors found. [08:20] I've got eight processes running US downloads and similar stuff has happened to all of htem [08:20] chronomex: What will happen if I kill a process? [08:20] (what will streamer do?) [08:22] underscor: I don't think adding more workers is going to speed it up [08:22] at least not for the US site [08:22] yipdw: Yeah, I think we're "maxing" it out [08:22] yipdw: cancel it and do a git pull [08:22] alard: can we get a way to only request accounts from it splinder [08:22] db48x: I did [08:23] Coderjoe just checked in a patch that makes it stop after 5 retries [08:23] 4 retries [08:29] db48x: yeah, I'm running with it now [08:29] it should at least permit better progress [08:31] yikes, these large EC2 instances are expensive [08:31] $0.34/hour [08:31] this one I'm running might have to shut down in a couple of days :P [08:31] well, wait, we have four days left to download (assuming the 24th is a real deadline) [08:32] fuck it, I'll just max it for the next four days [08:33] smrt [08:35] s-m-r-t, I mean s-m-a-r-t [08:48] hm, I got the "no more usernames available" message from the streamer, running from HEAD [08:48] are all three tracker instances hit that hard? [09:09] I only have one micro instance left running. I ran up a bill that is already larger than I would have liked with a spot m2.xlarge instance >_< [09:24] how much is the damage? [09:25] I think I blew through my monthly download limit in a week, I shall find out tomorrow if I'm over [09:25] Cameron_D: heh [09:26] comcast isn't very happy with me either [09:26] 3-4TB a month since I signed up [09:26] Hah, wow [09:26] I have a 500gb limit split in on/off peak [09:27] they give me a call every month and they are always surprised when I say that that's the amount I intend to use next month [09:28] damn. splinder is fucked as shit. why isn't my shit going. [09:28] heh, what is their fair use policy like? [09:28] maybe they got tired of us and are blocking IPs :P [09:28] nah [09:28] no, it's been itermitant [09:29] here we go [09:29] man, alard, every time I use this stuff you made I'm happy. [09:30] hrm [09:30] I can't load google read or google mail [09:30] too much internets! [09:31] alard: for the next rework of this, I suggest leaving the notification-of-done to the client/streamer. I'd like to detect all tracker failures and implement a backoff, but I'm not quite sure how to do that now [09:33] I mean, if you're expiring units then it's fail-secure. but I'd still like to :) [09:38] another improvement we should make is to wget [09:38] if wget could retry until it got a success instead of a failure, then we wouldn't have to delete the whole user and start again [09:39] barring that, we could retry each phase separately [09:39] hmm. [09:45] db48x: $91 for transfer and $53 for the instance [09:45] mmm [09:47] and the clock is still running on some stuff i stashed in s3. that's going to sit until next month, as the storage cost is less than the transfer cost would be. [09:47] ouch [09:49] is anyone still running mobileme clients? the graph looks a bit flat [09:50] all my systems went over to Splinder [09:50] yea, splinder is more urgent [09:53] splinder streamer updated to continue in face of tracker unavailable, git pull if you want [09:53] I do want [09:54] it's a simple stupid fix. [09:56] oh, I see [09:56] meh, it works [09:56] I highly doubt we're going to hit problems with the tracker actually being out of usernames [09:56] although, we're almost below 600,000 [09:57] yea [09:57] tracker unavailable as in not responding. [09:59] chronomex: is this your script? [09:59] dld-streamer.sh is mine. [09:59] look up bash's "wait" builtin (re comment starting at 108) [10:00] I took alard's dld-client and modified it [10:00] Coderjoe: I did. I can't wait for "next child to exit", so I spin around every second or so and check each one. [10:00] the comment isn't very well phrased. [10:01] also there's some asymmetry over how it handles the two events of concern [10:01] it starts a new client once per loop, but it reaps dead children an unlimited number of times per loop [10:02] partly that's because of how I did it, but partly it's because I think it's better design to allow the counter to sometimes go down faster than it can go up [10:02] you could have looked at how chunky works and modify that... [10:02] chunky? [10:02] i have a whole bunch of code there to keep x number of processes running and stuf [10:02] from the friendster project [10:02] hm. [10:03] ah, right. I wasn't paying attention to #archiveteam that month. [10:03] christ, the thing I don't participate in is the thing that everyone from the media fucking cares about [10:03] even a half-decent UI to change the number of children, report what children were running, and cleanly exit [10:03] I suppose I should read my irc logs. [10:03] hmmmm. [10:03] I'll see what I can pull into the streamer from that [10:04] I'd like to turn this tracker/streamer combo into something rapidly deployable for future projects [10:04] the tracker in particular makes me really want to win the leaderboard. [10:05] or, shit, even show up anywhere. [10:06] the media cares about the friendster project? [10:06] strangely enough [10:06] they don't give a shit about geocities, mostly, I'm not sure why [10:06] "but these guys archived friendster!!" [10:06] maybe if I had ever used friendster I might understand [10:07] geocities is too old for many to remember, perhaps [10:07] fuck man I'm 24 and I remember geocities. [10:07] I even have a goddamn book on it [10:07] but friendster was noteworthy for being the first big social networking site [10:08] hrm. [10:08] tbh I kinda didn't know about friendster more than as "something that people who write for wired magazine talk about" until #archiveteam hit it like ten tons of bricks [10:11] heh [10:11] on the splinder wiki page: "Downloading one user's data can take between 10 seconds and a few hours." [10:11] I've still got one going since the 13th [10:12] HMMMM. [10:12] it is, amazingly, making progress [10:12] excellent. [10:12] heh, cool [10:12] how big is it so far? [10:13] 837M [10:13] one blog [10:13] journal.splinder.com [10:13] cheeerist [10:13] oh [10:13] I've bitched about it before, but I'm amazed it's still going on [10:13] in a world [10:13] one blog [10:13] one man [10:14] can download it all [10:14] they should have just hosted Splinder on EC2 [10:14] heh [10:15] yeah! back when ec2 didn't exist and splinder did, they should have moved it to a nonexistent platform service instead of buying metal! [10:15] yes [10:15] time travel [10:15] well, you know. neutrinos. [10:15] solves all infrastructure problems [10:15] heh [10:15] frsrs, yipdw [10:15] it even makes Rails scale [10:16] doesn't quite work on mysql, though [10:16] heh [10:16] I've not used MySQL in years [10:17] how is it nowadays? [10:17] idfk. [10:17] I don't like it. [10:17] it's a decent key-value store masquerading as a shitty excuse for a relational database [10:17] actually, if time travel were possible we could integrate it directly into our cpus [10:17] long while ago, I installed Postgres and just haven't found a reason to hate it [10:18] if you want relational database, use postgres. if you want nonrelational datastore, there are loads of good choices. [10:18] relational datastore is very hard problem. [10:18] there's been a lot of study of circuits with time travel capabilities [10:18] that's why there are so many free nonrelational datastores ;) [10:19] time travel is probably impossible though, since it turns a turing machine into a hyperturing machine [10:21] that's got to be the strangest explanation for why time travel is impossible that I've heard in a while [10:22] heh [10:22] yipdw: It's not possible to ask the tracker for splinder-IT accounts only, though I could move the US accounts to a separate tracker, if that's useful. [10:23] you think that'd be good? [10:23] alard: it's not really a huge problem now that the retry logic has been fixed [10:23] hardly a fix, just kind of a shitty bandaid [10:23] the real problem I was hitting is that a bunch of workers would just get stalled out on 5xx errors [10:23] well [10:23] it's Good Enough [10:23] :P [10:23] not much one can do about application servers or reverse proxies crapping out [10:24] DOWNLOAD HARDER [10:24] wow [10:24] 420 chill out man [10:24] my data folder is 28gb in size [10:24] 420 should be an HTTP client error [10:24] the tracker reports 12gb for me [10:24] yipdw: yeah, it is server telling client to chill out [10:24] oh ha [10:24] it is [10:24] well, used by Twitter, anyway [10:25] 555 server used in movies only [10:26] aww [10:26] http://tools.ietf.org/html/draft-nottingham-http-new-status-02 [10:26] Cameron_D: the tracker only counts the size of the warc files, not the size of the logs that are also on your disk, or the overhead caused by small files [10:26] that proposes HTTP 429 for use in rate-limiting schemes [10:27] a lot of that 28 gigs is the overhead for the list of blog urls and list of media urls for each user [10:28] ah [10:28] a few hundred bytes stored in a 4k inode [10:28] yipdw: I hate postgres because of the fucked up permissions stuff. [10:28] du --apparent will tell you the file size totals, not the amount of disk usage [10:37] Coderjoe: I haven't messed much with the permissions on PostgreSQL objects [10:38] there is a whole set of problems with it, imo. [10:38] though newer versions appear to have addressed one problem. (default permissions on new objects) [10:39] imo, it is a very ugly mess [10:40] (the schema object type also looks like a terrible hack, which also adds complexity to permissions) [10:41] actually, I think you're the first person I've talked to about PostgreSQL who has actually cared about its permission system [10:41] heh, ditto [10:42] i ran into trouble when testing some stuff at work, where permission separation was a requirement [10:42] ah [10:42] I suppose real work requires permissions and shit [10:44] eh [10:44] depends on the application :P [10:45] every PostgreSQL application I've run into so far is set up such that there's one user that owns the database, and the application connects as that user; everyone else who needs data usually does that via a service, or in special cases users with read-only permissions [10:46] yeah that's how I usually do things, tho I typically do it for personal projects and stuff [10:46] I realize that from the least-privilege perspective that's probably horrifying, but (1) it works, (2) if there's a security problem honestly it probably isn't going to come from a database credentials compromise [10:47] yup, epecially when you restrict db access to localhost [10:47] (3) the applications aren't public :P [11:36] I've got a patch to download into a ram-backed temporary directory instead. I'll see how it affects performance and if it's beneficial I'll share it tomorrow. [11:36] being more careful because it (1) requires root-level access to run, and (2) needs a graceful stop of everything, I think. [11:38] actually, wait, no, I've just got my text editor set up wrong for editing shellscripts that are running. [11:40] I've pushed it to https://github.com/chronomex/splinder-grab if anyone wants to play with it. [11:41] shouldn't require a restart BUT YOU MUST MOUNT A TMPFS ON /tmp/tmpfs BEFORE RUNNING IT [11:41] erm, that should read BEFORE PULLING IT AT ALL [11:41] # mkdir /tmp/tmpfs && mount tmpfs /tmp/tmpfs -t tmpfs [11:43] it seems to work here, fwiw. haven't tested thoroughly. [11:44] also. your tmpfs may be too small, be careful of that. [11:44] the entire purpose of this patch is to have files not hit disk unless they're the permanent copy [11:47] because creating a files/ directory full of tiny things and then deleting them soon thereafter is kind of a disk stress test [11:48] damn. bedtime. peace. [11:51] chronomex: (or anyone interested) It may also be worthwhile to set the warc-tempdir to something on that tmpfs. The warc extension uses its own temporary files. [11:52] ah, huh. not going to do that tonight, but maybe tomorrow. [11:52] Sounds sensible. [11:53] yeah, my IO throughput has dropped by scads and bounds in the past 20 minutes. [11:55] now hopefully I don't run into anything huge overnight, that'd probably totally fuck this up. [11:55] actually I'm going to stop it and resize my tmpfs tomorrow morning. [11:55] goodnight for real! [14:04] So basically the dashboard is a way for us to know how awesome Closure is. [14:04] heh [14:05] alard: btw, can you make the dashboard show everyone, not just the top 11? [14:13] We have to use this dashboard in the futue. [14:13] Also appreciate the sync-fnished. [14:13] I don't agree showing "everyone". [14:14] I do think a link to the full list would make sense. [14:14] Also, it'd be fun to have a bar at the top or somewhere showing how close we are. [14:14] well, we had some newcomers today who would have appreciated showing up in the list [14:15] They need to work harder [14:15] but sure, showing the top 10 by default and then letting the user expand it, that's cool too [14:15] Also, they show up on the scrolling list [14:15] yes [14:15] You can see your name go by [14:15] which is good [14:16] but seeing the count there is also good [14:16] It's a UI issue. [14:16] Let's say yo have 100 people. [14:16] what then? [14:16] Wow, having my stuff go through a cellular connection really slows my typing up! [14:17] then we are vitims of our own success [14:19] Speaking of which, I gotta clear some crap off batcave. [14:20] Poor batcave [14:23] heh [14:24] SketchCow/db48x: The dashboard/tracker code is on the ArchiveTeam github, https://github.com/ArchiveTeam/universal-tracker , in case you haven't seen. (I still have to push my recent additions, though, and it needs a bit of cleaning.) [14:24] ooh [14:24] I'll have a look at how to show a complete list. [14:24] Maybe it's fun to make it so that you can toggle between the update list and a scrollable list of every contributor? [14:25] (In the right panel?) [14:28] numberOfDownloaders: 18 - (<%= tracker["domains"].size %>) - 4, [14:28] numberOfLogLines: 18, [14:28] just tweak that [14:28] To 100? [14:28] The problem with that is that it pushes the graph too far down. [14:28] numberOfDownloaders: <%= tracker["domains"].size %>, [14:28] numberOfLogLines: <%= tracker["domains"].size %> + 4, [14:29] actually, it should be + 7 [14:29] (The 'domains' thing refers to the blogs/html/media, by the way. It comes from the four mobileme domains.) [14:29] oh [14:29] hmm [14:29] I haven't. [14:29] Good, it's good to know. [14:29] I thought the name was funny :) [14:29] db48x, I am going to guess, does not do much UI. [14:30] the 4 is the rest of the summary lines [14:30] Seriously, you don't want this bloated list exploding the page and pushing everything down for pages and pages just to make sue every special snowflake gets his moment in the sun. [14:30] Like I said, work harder. [14:30] Yes. There are 18 lines on the right, so the left should also have 18 lines. [14:30] Or, and we can do this, make the clients get badges. [14:30] A badge/gamification system would be fun. [14:30] But for 1.0, this is excellent - it gives us a chance to see the work being done plainly. [14:31] SketchCow: the graph is already pushed down off the page [14:31] I do like the idea of a bar graph above or just below that total, so you can visually see how far we are. [14:31] so much depends on the screen size of the viewer anyway [14:31] So you want to push it WAY down, constantly updating, essentially, poop. [14:31] I like it as it is. It's smart. [14:31] I am all for a breakout window to see all the little snowflakes. [14:31] Precious, unique little snowflake. [14:34] Well, I just found out the internet goes soft on this train when it goes through a tunnel. [14:34] Like AM radio. [14:41] funny thing about radio waves [14:41] Man, seriously, this per-character typing becomes like torture. [14:41] per-character? [14:42] Yes, where it's buffering out to deal with the cell modem on the train, so you see characters at SLIGHTLY different, wavering times. [14:42] ah [14:43] About to plunk in 7.2gb of CD-ROMs [14:43] We're past the 700 CD-ROMs mark. [14:43] We actually outstrip the library of congress at this point. [14:43] individual stats page for each snowflake would be pretty cool, too [14:43] Agreed, each snowflake can be a hero [14:44] We should steal wow character images for the user photo [14:45] We probably need more people for the downloading, huh. [14:46] that's another thing I'd like to see... estimated completion time based on current rates [14:46] but yes it looks like at this rate we would fall short of the total [14:48] --------------------------------------------------------------------------- [14:48] HEY ARCHIVETEAM MEMBERS [14:48] WE NEED YOU IN ON THIS SPLINDER DEAL [14:49] http://www.archiveteam.org/index.php?title=Splinder [14:49] HOP IN ON THAT SHIT, WE HAVE 4 DAYS [14:49] --------------------------------------------------------------------------- [14:49] I'd use the machine I have at archive.org but of course underscor is bogarting that shit [14:49] I have had spotty performance using dreamhost's servers. [14:50] heh [14:50] And by spotty I mean something comes in and rapes the processes [14:50] also, these scripts are a bit heavy; they use a lot of processes and hit the dreamhost limits after just two clients [14:50] Yeah [14:50] I don't have a metric ton of servers out on the net these days [14:51] This distributed tracker keeps track of downloads, but it doesn't know about incomplete ones, does it? And how is it tracked which profiles are ultimately safely uploaded to the batcave? [14:58] dnova: It keeps track of what has been given out, to whom and when. So eventually there'll be a list of accounts that have been claimed but have never been marked done. It doesn't track uploads to batcave. [15:00] alard: If we do that (track uploads) then we should have a different board entirely. [15:00] I don't like overloading a single status thing. [15:00] I just like re-using this thing over and over. [15:00] It's very game-y. [15:00] cdaction-63.iso cdaction-68.iso cdaction-71-2.iso cdaction-72-2.iso cdaction-76.iso cdaction-79.iso [15:00] root@teamarchive-0:/2/CDS/INBOX/MCbx6# ls [15:01] cdaction-68-2.iso cdaction-70.iso cdaction-71.iso cdaction-75.iso cdaction-78.iso cdaction-80.iso [15:01] root@teamarchive-0:/2/CDS/INBOX/MCbx6# du -sh . [15:01] 7.2G . [15:01] Now to upload. Watch how long it takes me. [15:01] SketchCow: Yes. Also, tracking uploads is not that useful, perhaps: for example, it takes me much more time to upload than to download. [15:01] \o/ [15:02] I'd like to see how much is actually, really, for sure saved. [15:04] Almost nothing, I expect. [15:05] right now, yeah [15:11] Damn, this mom in another seat is so annoying [15:12] It makes ME want to smoke and have premarital sex just to get out from underneath her fascist outlook [15:12] Poor ladies [15:12] Who will help mthem? [15:13] Also, I think it's time to really start thinking about an e-mail alert list. [15:22] Uploaded those 7.2gb of discs.. but moved on, did another 4. [15:26] a mailing list would do the job [15:34] So, I'm possibly going to have a third archiveteam related panel at SXSW. Or, I should say, one will be there. [15:35] They have this idiotic rule that if you are on one panel at SXSW you can't be on any others. [15:41] underscor: should I change the fork_more=0 under # empty? it's not the only one and I don't want to screw anything up [15:42] oh, nm, chronomex specified a little [16:36] hi guys, what's a good number of downloaders to run for splinder? [16:36] use the streamer and multiple threads [16:37] is 25 enough? [16:37] ./dld-streamer.sh # [16:37] make sure you get hte latest git clone because there is an important modification [16:37] depends on your resources, dashcloud [16:37] I did it yesterday night- any changes since then?> [16:37] yes I think so [16:38] I'm not sure when the patch was merged [16:39] what do I need to type after git pull ? [16:40] the url I guess [16:40] I already cloned it- I just needed to update it [16:40] yeah [16:42] I'm not really sure how to do that [16:44] "git pull" should be enough by itself, after cloning. [16:44] ...so long as you're in the directory with all the script files. [16:44] ah [16:44] oh ok thanks [16:46] dashcloud: run 25 threads, see how it goes. You can run multiple streamers concurrently if you want to add more threads [16:47] okay [17:19] how so i stop the streamers to do a git pull [17:19] gracefully [17:22] well, the bad news is, stopping them gracefully takes forever. [17:22] but the command is touch STOP [17:22] in another terminal window [17:22] do that, do a git pull, then start more streamers but leave the gracefully-exiting ones alone [17:22] because they'll keep going but it will just take for-fucking-ever [17:22] yeah another window [17:22] same directory [17:23] "touch stop" in splinder-grab [17:23] touch STOP [17:24] oh ok the other window saya stopping on request, thats cool [17:24] is 30 threads ok [17:25] if you can spare the cpu and disk IO, try more [17:25] you can start up 30, check your resources, start another 30, check your resources, etc [17:25] until you're at your maximum comfort level [17:25] thing is, i can add 100 thread and it just goes into reap mode, these scripts, are really really lioght on resources, apparently [17:25] if it's going into reap mode you didn't do a git pull [17:26] ahh ok, thats what the commit is [17:27] I modified the script myself from some people's instructions last night... before I learned how to do a git pull, haha [17:27] this project is literally the first time I'm using git [17:27] will git pull overwrite local changes u ve made/, also me too, first time git user [17:28] no, it tells you to fuck off if you made local changes [17:28] so I had to copy a fresh script over my modified one, then do git pull [17:28] to get the "official" patched version [17:28] well, thats mean, it should ask to overwirte [17:29] which would be easier that diff merge arbitraryily different code [17:29] brb [17:29] because there no way to know what changes u made locally [17:30] ive been running this for 15 hours, and i have less than 600mb of users [17:35] run more threads [17:37] im running 40 at a time [17:38] well if you want more users per time you gotta run more threads. [18:06] splinder is sloooooowwwwww [18:06] plus users are smaaaaalllllll [18:12] SketchCow: It's running on batcave and abuie-dev [18:14] http://tracker.archive.org:8998/download/wcd_fuckin-your-daughter-with-a-frozen-vomit-fuck-stick_prosthetic-cunt_flac_lossless_29635610/xxxx SketchCow's favorite album [18:19] love that a lossless flac was used. it's like the intersection of insane rage and cold nerddom [18:31] hah [18:38] 7/80 PID 3723 finished 'us:foreveralone': Success. [18:38] hahahahahahaha [18:38] That brings back memories [18:46] my download.logs add up to 8983 lines [18:46] I realize a lot of them are currently downloading [18:46] I think [18:46] but um [18:46] not 1,684 of them [18:46] the leaderboard is fixed, fixed I tell you!! [18:48] is it possible that splinder is throttling the user-agent used by our scripts? [18:54] how can I check for myself how many complete users I have? [18:57] oh wow... for one of my machines, 790mb out of 2.2gb are incomplete users [18:57] that's ... really bad isn't it? [18:58] it was 28 GiB incomplete + 78 incomplete for me [18:58] ehm 78 complete [18:58] other machine is about 1500mb incomplete, 6.2gb total [18:58] jeez :( [18:58] if you have few processes, they all stuck with "big" users [18:59] why does it give up on them? [18:59] it doesn't [18:59] it'ìs just sloooooooow [18:59] OH [18:59] they could still be going [18:59] I see I see [18:59] they are [18:59] look at the last modification of the warc file [19:00] (or, don't bother; they always are :-p) [19:00] * Nemo_bis has dinner at last [19:00] I'm a lot further behind than I thought though, heh [19:04] eh [19:04] we'll get there. [19:05] chronomex: I know for sure that some of my incompletes are not still being worked on [19:05] is there going to be a way for me to re-queue them? [19:05] or give them back to the tracker? [19:05] yeah I just did a patch to streamer to make that clear [19:05] unfotunately I had to run catch the bus [19:05] it's short, I'll redo it soon [19:06] ok [19:09] also my tmpfs modification winds up with the tmpfs filling up, i'll have to resolve that [20:31] oh [20:31] the leaderboard goes by space not # of users [20:31] that's a little odd [20:35] dnova: space is more intresting [20:35] who saved the most data? [20:38] no, space is not more interesting [20:38] 10tb of lolcats vs 1tb of books [20:39] I guess I should have noticed that's how it counts, haha [20:39] oh well [20:41] Schbirid: but space is what takes time, bandwidth, and disk space. and we're not inspecting what each user has done with the space they used. [20:42] it depends on the focus [20:42] if i had the choice of mirroring only the profiles of 100000 users or the full photo collections of 100 i would choose the profiles [20:43] Coderjoe: WE aren't but in the future some historian might. That's the idea here [20:43] either way is fine though download download download [20:46] yeah [20:47] :D [20:47] :D [20:47] keep it on the download [20:47] lol [21:44] underscor: what's foreveralone [22:14] it was the friendster project [22:20] 100/100 PID 7335 finished 'us:wallacemathe716': Error - exited with status 6. [22:50] Maybe part of my success is that I do NOT use dld-streamer, but dozens of dld-clients [22:50] I just tried dld-streamer and it went into "reap mode" after 5 users... out of 50 [22:51] ndurner1: git pull [22:51] It' [22:51] It's fixed in the latest one [22:51] ok [22:57] Best word ever: Garret [22:57] noun: an attic, usually a small, wretched one. [23:27] ndurner how do you produce dozens of dld clients [23:50] Testing, testing, is this thing on? [23:50] after testing fonts [23:50] I've got 99 problems, and archiving ain't one [23:50] happy place ain't have giant complexes that are so boring [23:50] heh [23:50] heh [23:58] Gah, all my splinders are stuck on larger users now [23:58] closure: How many threads do you run?