[00:07] *** patricko- is now known as patrickod [00:10] *** patrickod is now known as patricko- [00:45] tpw_rules: git annex info [00:46] ugh i'm running 16 at once and still having trouble pegging my internet [00:47] *** ohhdemgir (~ohhdemgir@[redacted]) has joined #internetarchive.bak [00:50] IA.BAK/server 0533231 Joey Hess: fix html directory for stats [00:50] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/jhIx [00:51] Well, that explain that [00:52] indeed [00:52] Oh shit son, we have +4's [00:52] and I see there are now a few files replicated a +4 .. [00:52] ys [00:52] +4 is the current target [00:52] hrm, or is it +3? [00:53] anyway, might get a few over the target due to collisions [00:53] *** londoncal has quit (Quit: Leaving...) [00:56] So, some questions, likely some unanswerable. [00:56] Can you, as the poobah, see how much disk space is out there among the clients? [00:56] (I mean amount that is filling, not how much is out there and reported back. Obviously that's what that list is.) [00:56] the data is there, it would take a little bit of calculation [00:56] oh, you mean the unused space too? No [00:57] What I mean, to be specific, is to go "we have X tb in total, which we are now filling." [00:57] we could add more reporting when the clients report back, but git-annex doesn't track that anyway [00:58] well, this project is radged! I'm in [01:00] *** SN4T14_ (~SN4T14@[redacted]) has joined #internetarchive.bak [01:00] *** SN4T14_ is now known as SN4T14 [01:00] SketchCow: oh, we could add a client count though [01:01] currently 13 [01:01] Well, I want that listing out there we talked about, so I can make a map. Go ahead and do it and let me know the URL [01:01] Have it update, oh, once a day [01:01] Or maybe more for now, while more people are joining [01:01] hmm, I'm uncomfortable putting up IP addresses on http [01:01] Well, if you'd like, YOU could do the calculations, you're a genius [01:02] I could rsync them to teamarchive1, or .. [01:02] And then give me the rough names it shoots out [01:02] ugh, no time [01:02] busy genius [01:02] tell me something to run, or I'll give you an account [01:02] Well, how about this. e-mail me a list. I'll write my code and crap, then hand it to you. [01:02] never looked at geodns stuff [01:02] sure [01:02] We're just doing extremely general, after all [01:03] *** patricko- is now known as patrickod [01:03] *** patrickod is now known as patricko- [01:08] *** patricko- is now known as patrickod [01:18] is this at a point where you want more testing clients? [01:20] yes [01:27] The more the better, now. [01:27] I don't want people sacrificing anything for it, don't go for unwarranted overuse, but it's worthwhile to get us in the realm of our plans [01:28] closure: How big is the current test shard? [01:28] ok, how do I get started? [01:29] Ah, I can read. 2.91 TB. [01:30] trs80: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation#SHARD1 [01:32] yeah, 2.91 tb, but you don't need that much disk, it will use what you give it [01:32] how often should i fsck? [01:33] what does that check? [01:33] still need to figure that out.. it checks the file contents [01:33] ah [01:33] so just md5 * [01:34] Any problem running on an NFS mount? [01:36] IA.BAK/pubkey d95afb3 Joey Hess: add trs80 [01:36] [IA.BAK] joeyh pushed 1 new commit to pubkey: http://git.io/jhni [01:36] aschmitz: there can be lock file problems. I would recommend not running multiple concurrent downloads on nfs [01:36] *** patrickod is now known as patricko- [01:39] But with just one copy of git-annex, it should be fine? [01:40] Second question is whether there's a way to get git-annex to use a specified keypair, rather than ~/.ssh/id_rsa. [01:40] probably. [01:40] aschmitz: the current iabak script generates its own dedicated ssh key, and makes it be used. so yes [01:40] Ah, fun. [01:43] *** patricko- is now known as patrickod [01:44] closure: oops, so let me send you that new key [01:44] hmm, so you don't need to manually install the latest git-annex, ./iabak does that for you (although it got an i386 version on amd64) [01:46] IA.BAK/pubkey d2f5097 Joey Hess: swap in right key for trs80 [01:46] [IA.BAK] joeyh pushed 1 new commit to pubkey: http://git.io/jhWq [01:47] yeah, hadn't realized that old documentation was still in the wiki [01:50] hmm, now it's gone into a sleep for 1 hour [01:50] *** patrickod is now known as patricko- [01:52] right, because iabak-helper doesn't run git-annex init in shard1 [01:53] hmm, but it should have ... [01:53] ah, because my git wasn't configured with user.name/email [01:55] oh, ok [01:56] trs80: did that leave the shard1 empty? [01:57] sounds like the problem zottelbey had earlier [01:57] closure: yeah, it did [01:58] IA.BAK/master 717e95e Joey Hess: set user.name and user.email locally to deal with systems where git falls over otherwise... [01:58] [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/jh8D [02:02] also line 107 comparing versions failed a little bit because I ran git-annex init in the root dir, causing repository version lines to be output. maybe add -m1 to the grep "version" [02:04] little things to work around stupid users :) [02:04] IA.BAK/master 418a7d3 Joey Hess: make version grep look at 1st line [02:04] [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/jhBB [02:05] http://cl.ly/image/24082s0z3c1c/Screen%20Shot%202015-04-01%20at%209.04.57%20PM.png EHEHEHHEHE [02:10] IA.BAK/pubkey 1f9922f Joey Hess: add aschmitz [02:10] [IA.BAK] joeyh pushed 1 new commit to pubkey: http://git.io/jh0B [02:11] Thanks! [02:20] closure: fwiw I still can't ssh [02:20] *** hatseflat (~hatseflat@[redacted]) has joined #internetarchive.bak [02:20] (at least, the script fails with "you're not signed up yet") [02:22] interestingly manually reconstructing the command its using works fine [02:22] (ssh SHARD1@iabak.archiveteam.org git-annex-shell -c configlist shard1) [02:22] underscor: The script uses id_rsa in the local directory, while your manual command uses the one in ~/.ssh. [02:22] aha [02:24] Might be best to generate a new key / use the id_rsa.pub that probably got generated, rather than copying your personal id_rsa around to more places, though. [02:24] this was a manually generated one from a previous iteration [02:25] but yeah, good point to consider [02:25] how long should I expect git-annex --library-path to churn before starting to download? been about 20 minutes so far [02:26] hum [02:26] my dirname doesn't have -z [02:26] weird [02:27] trs80: it can take a while on a slower disk.. you could ctrl-c, touch IA.BAK/NOSHUF and avoid the overhead of the shuffling it does [02:30] "git annex [...] get -- [item names]" seems to just be hanging for me? There's a "[git] " a few processes after it, if that's relevant. [02:30] 100% CPU, but no network traffic, and strace seems to mostly be it checking the time. [02:31] this is gonna make my disk fragmented as fuck [02:31] aschmitz: same here [02:32] what happens if you just try git annex get [02:32] that's what i'm doing and it's working great [02:33] Hmm. That'll be ordered, though. [02:33] Which isn't ideal, but better than busy waiting. [02:34] ordered? [02:34] Alphabetical by item name, no? [02:34] (The requests) [02:34] oh yeah, i think [02:34] IA.BAK/server 56507ef Joey Hess: sketchcow's geoip extractor [02:34] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/jh2S [02:35] aschmitz: by default it looks at all 100 thousand files, finds ones that don't have enough copies, scrables the list, and downloads at random. this takes a while [02:35] did that update just recently happen? [02:35] because mine is doing it alphabetically [02:36] using like git annex get --not --copies 3 [02:36] Well, I think it had picked some to download, as it has a huge command line that looks like a result of that. [02:39] hmm idn [02:40] Hm, interesting problem. [02:41] Looks like some of these items have since been darked. [02:41] there are many in the shard that have been [02:42] I just killed that process and a new one started, which is now writing stuff [02:42] (though i'm not sure what 'darked' means in IA lingo? is it permanent?) [02:42] IA.BAK/server 1dafefc Joey Hess: perm fixup [02:42] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/jhVP [02:42] although that git is now defunct again [02:42] "darking" just makes items unavailable to the public, the IA keeps a copy, and could revert that if they wanted to. Doesn't usually happen, though, as far as I know. [02:43] IA.BAK/server efda62a Joey Hess: sketch had a sort -u in there which I forgot [02:43] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/jhVH [02:47] closure: so the shuf completes, but the git-annex subprocesses is dying later on [02:51] SketchCow: here we are! http://iabak.archiveteam.org/stats/SHARD1.geolist [02:52] trs80: hmm, if git cat-file is dying for some reason that must be the problem. I'd like to debug this, but not tonight [02:52] it has my zip code wrong :( [02:53] it's got my city right - probably because the box is sitting inside the ISP lol [02:53] Yay ICBM addresses for everyone. :-/ [02:54] sep332: how do you do these things? what isp [02:54] Must get IA in Antarctica... [02:54] oh i cheat! i work there ;) [02:55] oh [02:55] but shit, that lat/lon is only 3 miles from my house [02:56] http://iabackup.archiveteam.org/ia.bak/ [02:56] whoever is doing that may want to chop off a couple digits just in case [02:57] if you're truly concerned about the lat-long [02:57] We can remove it. [02:57] SketchCow: hmm, why only 6 clients? [02:57] we remove IP already. [02:57] closure: Old data [02:57] I've been hacking, bro! [02:57] closure: is it expected that the git cat-file sits using a bunch of cpu for a while before downloading starts? [02:57] Personally I'd stick with country and region, but I wouldn't fight over it or anything. [02:58] i'm personally not at all. but it's a concern in the community. i'd probably round to one decimal [02:58] underscor: yes [02:58] closure - kill lat-long [02:58] The code is obvious in the script [02:58] And my thing doesn't care. [03:01] http://iabackup.archiveteam.org/ia.bak/ now upgraded with all 15 clients [03:01] IA.BAK/server 61a093f Joey Hess: de-icbm [03:01] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/jhKz [03:03] closure: yeah, git ls-files is what goes defunct [03:03] oh, interesting it's ls-files [03:04] right now it's stuck in write(1, [03:04] kinda suggests it's due to all those files being shoved through the command line and to ls-files [03:04] so the destination pipe is full I guess [03:04] ls-files is stuck in write? [03:05] yeah [03:05] Now the fun part, more nerdy than anything. [03:05] I want to add a second visual chart. [03:05] I mean a second graphic chart. Now to understand how to make the api not blow up. [03:05] to do what [03:05] just US. [03:06] Because it's important to know how far from IA ground zero they are. [03:06] oh [03:06] where is IA ground zero? [03:06] San Francisco. [03:06] ah [03:06] We have one person in Walnut Creek [03:06] also why can't you just mail tapes to maine or something [03:06] Fuck that guy, bomb's going to get him too [03:06] that's a different solution path [03:06] We can build a really nice off-road car, AND work on our sailboat [03:07] AND our drone army [03:07] yes [03:07] where is amazon headquarters [03:07] Seattle [03:07] oh [03:07] But bear in mind, their shit is EVERYWHERE [03:07] i was gonna say "free IA tape with every drone delivery" [03:07] beam gps at them so they come to your facility, attach a tape, then let them go [03:09] do you think you'll be able to accelerate backup faster than the archive adds new crap? [03:11] How did you stumble into this project? [03:11] My tweet? [03:11] a tweet from @textfiles [03:11] Holding off on USian graph - just because we had some nice advancement today, don't need to hack to 6am [03:11] tpw_rules: SketchCow = @textfiles [03:11] ah. so yes [03:12] I'll add US later. [03:13] So, tpw_rules - I could say a very long thing, or I could say "the mountain must be climbed". [03:13] Is "the mountain must be climbed" sufficient or do you want the long thing. [03:13] that's good enough [03:13] i'll do my part [03:14] the project is forcing a mass of assessment of the archive [03:14] also particularly in the US good fucking luck finding people without capped internet [03:14] i have to pay a ridiculous amount extra for no cap [03:14] Which was desperately needed. [03:14] how is it protected internally? do you do tape or something in the HQ? [03:15] Everything with little exception is on spinning disks [03:15] are they all spinning at once? [03:15] i assume you can recover from a failed drive for example. but what about accidentally rm -rf? [03:16] also btw textfiles.whatever is real neat [03:17] i have to confess to being a youngin, so i was never around for that. but it's cool to read about [03:17] textfiles.whatever has always been proud of bringing history to the youngins [03:18] unless you make a bomb, and then we know you're not a smart youngins and we let evolution sort that out [03:18] i knew about it but never really immersed myself in it. i'm at least fluent in 6502 assembly language though, but not the culture [03:18] http://iabackup.archiveteam.org/ia.bak/ now lists the countries because I got tired of counting. [03:18] Or counts, anyway [03:19] can you put a size over the tree view at the top? [03:19] I did. 2.91 terabytes. [03:19] no i mean for each box [03:20] Not right now, no. [03:20] ie 1.7TB is not redundant at all [03:20] use the areaaaaaa [03:20] that's what it's forrrrr [03:20] *** tpw_rules gets out ruler [03:20] also textfiles.com* [03:20] Also http://textfil.es/ [03:21] for those pesky blockers [03:21] how is there not a .whatever gTLD at this point [03:21] lol. i need to try it at school [03:21] (though i always run with a vpn) [03:24] can you guys remove a repo from the list? i deleted everything from mine because it was being funky and now i show up twice. 1d92bde5-54d3-41bc-932e-d8e8e7bfff51 is my real one and ff2f752d-b35a-4555-b8b4-617f23e4e015 is bad [03:26] *** trs80 touches NOSHUF and starts again [03:26] ahh, sweet downloads [03:29] trs80: I reproduced the problem.. so I'll probably be able to fix it [03:29] closure: ah, cool. was going to say I'm in UTC+8 if you wanted to look at it another time [03:29] it seems it prints out an enormous list of directory names before stalling? [03:30] okay it's sleepy time for me. closure can you delete that extra repo? [03:30] tpw_rules: we could, but let's not worry about it. We want to automatically detect dead repos and disregard them [03:30] ok. i just noticed it with annex info [03:31] but goodnight. got 500gb so far [03:31] closure: yeah, that sounds like what's happening [03:32] that is seriously weird. it's like it thinks that's all one file [03:32] I think I'll just make it run git-annex once per dir for now, and debug this tomorrow [03:34] IA.BAK/master b3e2de7 Joey Hess: temporary workaround for strange hang-of-doom when git-annex is given a really, really big list of dirs to get [03:34] [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/jhDX [03:43] closure: the workaround wfm, although I've still got a defunct git process (not sure what type) [03:43] *** SO, if iabak is stuck not doing anything and eating cpu, now's a good time to restart it *** [03:43] cat-file and wget are fine though [03:43] remote: error: hook declined to update refs/heads/synced/master [03:43] :( [03:43] underscor: intentional, you're not supposed to be changing the master branch [03:44] oh [03:44] did I do something wrong? [03:44] haha [03:44] how do I check? [03:44] git log master [03:45] aha [03:45] wonder how that git commit happened, weird [03:45] and then you'll want to git reset --hard HEAD^ or so :) [03:45] but, please carry on trying to break it ;) [03:45] just don't break it by doing horrible commits in the git-annex branch, that is not checked yet at all [03:46] ok [03:46] OH LOOK WHAT BROUGHT BACK UNDERSCOR [03:46] :D [03:47] wow [03:48] man my shard is still really broken [03:49] closure: I reverted the commit and did the reset --hard [03:49] but it's still trying to commit master on annex sync [03:49] underscor: you probably have a synced/master branch lying around with the bad commit in it, which you'd need to delete [03:50] closure: is delete different than revert in this context? [03:50] oh, but commit master .. idk, why it would have something to commit [03:51] http://p.defau.lt/?rj4PyeY9VmzbB5LnHYw2qQ [03:51] yeah, git branch --delete synced/master [03:52] closure: and now, http://p.defau.lt/?0gvXoSd272FIja7elUcgBQ [03:53] you need to delete synced/master and reset master, both [03:54] yay! [04:00] trs80: ok, figured it out. It's simply an exponential blowup due to some fancy stuff it tries to do with the command line. Plus possibly a little bit of truncation [04:01] Improvement Continues! [04:02] Eventually, I will turn the graph page into an ad to help with the experiment. [04:12] *** espes__ (~espes@[redacted]) has joined #internetarchive.bak [04:17] sweet, sped that up by like 1000x [04:36] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak [04:40] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/jhN2 [04:40] IA.BAK/server 756666f Joey Hess: grep the compressed auth.log too, to get a full month of IPs [05:04] *** zottelbey has quit (Remote host closed the connection) [06:14] IA.BAK/server 930b511 Joey Hess: gc repo too [06:14] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/jjY8 [06:16] IA.BAK/master 30f5611 Joey Hess: remove debug output [06:16] [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/jjYF [07:05] *** bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak [07:10] right, 10 iabaks running, should be done in just over a day [07:11] *** bzc6p has quit (Ping timeout: 600 seconds) [07:52] IA.BAK/pubkey 0e91e20 Daniel Brooks: another for me [07:52] [IA.BAK] db48x pushed 1 new commit to pubkey: http://git.io/jjwL [08:11] *** londoncal (~londoncal@[redacted]) has joined #internetarchive.bak [08:48] *** londoncal has quit (Quit: Leaving...) [08:52] *** bzc6p_ has quit (Read error: Connection reset by peer) [08:53] *** bzc6p (~bzc6p@[redacted]) has joined #internetarchive.bak [09:12] *** edsu (~edsu@[redacted]) has joined #internetarchive.bak [09:23] hm? you can start multiple jobs on 1 box? [09:27] yea, git annex commands very carefully avoid stepping on their own toes [09:28] you can run 'git annex get' as many times in parrallel as you want [09:28] easiest way to do that is to run iabak multiple times [09:29] and then while they're running you can run git annex get manually to pull down a specific item that you're interested in [09:37] midas: Except for network filesystems. [09:39] good point :p [11:34] db48x: i think gnu/parallel (http://www.gnu.org/software/parallel/ ) would be nice to be build into the helper-script (as some kind of option) [11:57] [IA.BAK] zottelbeyer opened pull request #9: correct zottelbeyer's pubkey (pubkey...patch-1) http://git.io/veebN [11:59] IA.BAK/pubkey 09a303b Daniel Brooks: Merge pull request #9 from zottelbeyer/patch-1... [11:59] IA.BAK/pubkey 6bb10d9 zottelbeyer: correct zottelbeyer's pubkey... [11:59] [IA.BAK] db48x closed pull request #9: correct zottelbeyer's pubkey (pubkey...patch-1) http://git.io/veebN [11:59] [IA.BAK] db48x pushed 2 new commits to pubkey: http://git.io/veeN1 [12:00] hater: possibly. it might be easier to build support for concurrent downloads into git annex itself [12:16] i am too lazy to lern haskell to programm a tool which already exists [12:17] https://git-annex.branchable.com/todo/parallel_get/ <-- Posted 3 months and 14 days ago [14:43] i switched over to using the iabak script and i'm getting an error [14:43] error: Untracked working tree file 'internetarchivebooks/100storyofpatrio00sinc/100storyofpatrio00sinc_archive.torrent' would be overwritten by merge. [14:43] can I just delete the file and try again? [14:53] that's weird.. is your repository in direct mode maybe? [14:53] git annex info --fast [14:54] 1st line [14:54] i just did a "git clone" and then copied the files to the shard1 folder [14:54] "indirect" [14:54] oh, hm, so you switched over by copying files? [14:55] yeah [14:55] I hope you copied .git/annex that's where the actual downloads are [14:55] but really, the right way is to just move your old git repo to IA.BAK/shard1 [14:55] ok. i didn't realize about .git [14:56] suggest you move the files back to the old repo, delete the new repo, and move the old repo [15:04] ok, it's working fine. thanks closure [15:14] wow, we're over 50% on SHARD1 [15:14] er, no. Over 25% :) [15:14] well... i have 1.9TB on this drive now [15:15] shouldn't that be higher then? [15:15] maybe.. could be your client has not communicated back, if you just started running the script [15:15] did you ever git annex sync manually before? [15:15] script does it once an hour [15:16] yeah, i stole that snippet that runs sync every hour [15:16] and i ran it manually twice in the last hour [15:16] well, at 3 copies, SHARD1 needs 9 tb [15:17] Right. [15:18] of course, the graph is counting by files not by size anyway. So somewhat comparing apples and oranges [15:18] Shhh [15:18] Don't wreck my dreams [15:19] I agree, size is ideal. [15:19] But I like incrementing, after all [15:20] http://blog.dshr.org/2015/03/the-opposite-of-lockss.html [15:20] (My comment at end) [15:24] closure: If you create output files of data updated regularly about the activity, I'll make pretty graphs that display them. [15:25] SketchCow: how about a connecting clients per hour graph? [15:32] I'm for any textfiles you want to generate. [15:32] I'm not doing particularly smart graphing, so I'm converting files into graphs [16:13] warning: the iabak-helper script is broken atm: someone changed the output of 'git-annex version' - i pulled an bugfix but it is not merged into the master atm [16:21] here is the bugfix: https://github.com/cancerAlot/IA.BAK/commit/6c432e4808ebad9cbcb33902535a575d6b687f0e [16:55] closure: How hard is it for me to give you a collection and have you go "aaaaand here's the stats on that." [16:56] i.e. how big it is (originals and system files, number of items) [17:01] *** svchfoo1 has quit (Quit: Closing) [17:06] *** svchfoo1 (~chfoo1@[redacted]) has joined #internetarchive.bak [17:09] *** svchfoo2 gives channel operator status to svchfoo1 [17:14] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak [17:20] alright, its working now! though the speed is somewhat terrible. [17:22] db48x: 'one tool for one thing' - does implementing parallel-support into git-annex violate that 'rule'? [18:58] *** patricko- is now known as patrickod [19:03] [IA.BAK] db48x created git-annex from synced/git-annex (+0 new commits): http://git.io/veU3j [19:03] [IA.BAK] db48x created synced/git-annex from git-annex (+0 new commits): http://git.io/veU3p [19:03] [IA.BAK] db48x created synced/master from master (+0 new commits): http://git.io/veU3h [19:04] IA.BAK/server 2bad8e6 Joey Hess: add a client connections per hour data file [19:04] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veUsE [19:09] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veUZH [19:09] IA.BAK/server 14ffc71 Joey Hess: typo [19:11] hater: current version code has: git-annex version | grep "version" -m 1 [19:11] which seems to work ok... [19:13] *** closure goes and adds a git annex version --raw anyway [19:14] IA.BAK/server 0ac68d4 Joey Hess: typo2 [19:14] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veUCi [19:17] SketchCow: I can ingest a collection into a new shard pretty quickly, and then can do anything we can do with SHARD1 [19:18] by pretty quickly, 10 minutes or so [19:18] Which is great. [19:18] Mostly, I just wanted the ability for you to look at a collection and go "it's this big" [19:19] I've only done that for the number of files, which is all I care about, not disk size. The data is available in the census though [19:20] Anyway, I have our next collection, I think. [19:20] usfederalcourts [19:21] and genealogy [19:21] But obviously I think we should be at a SOLID 4 for current shard before we add more shards. [19:21] And we have some shard-punching to do, etc. [19:23] are we going to 3 or to 4? [19:23] and by 4 I mean, 4 including IA [19:24] it's a big decision [19:24] aw, what's 14PB between friends [19:25] ... times 1770 [19:25] er, you already multiplied, didn't yu [19:25] numbers too big [19:25] SketchCow: so here is a new textfile for you.. http://iabak.archiveteam.org/stats/SHARD1.clientconnsperhour [19:26] that is the number of clients that connected for that shard, per hour. [19:26] client connections [19:26] Since people seem to be uber-connecting [19:26] the guys that are running concurrent iabak scripts count multiple [19:26] it would take 3 days for my computer to count that high [19:26] call it "worker threads" or something [19:27] it would make a nice bar graph [19:27] closure: http://iabackup.archiveteam.org/ia.bak/ [19:27] I'm assuming we're working to make it 100% Dark Green [19:27] that's how it's set right now, yes [19:27] (the area graph. Making the map 100% dark green will take longer, muhahaha) [19:28] 13124639 usfederalcourts.list [19:28] closure: Well, that's what I'm shooting for. [19:28] Yes, usfederalcourts will be 1.3tb [19:28] 13 million.. so, that's 130 shards. They may be smaler than usual disk size, I dunno [19:28] that's file count [19:28] Ah. [19:29] Well, anyway, point is that I always assumed "4" (3+IA). Everything green [19:29] oh, ok. I pulled COPIES=4 in iabak from /dev/ass [19:30] My documentation and writing mentions it [19:30] I bet you got it there [19:30] These "sectors" are then checked into the virtual drive, and based on if there's zero, one, two or more than two copies of the item in "The Drive", a color-coding is assigned. (Red, Yellow, Green). [19:30] *** bzc6p has quit (Read error: Operation timed out) [19:31] I just stole that idea from Josh S., creator of Delicious and most bitter Google Employee ever [19:31] Who told me GMail works on "5 copies of mail, in 3 discrete geographical locations, at all time" [19:32] 5 is my bare minimum replication for important personal data. and yeah, 3 locations [19:32] See? So we both agree [19:32] IA + 3 [19:32] (IA is two) [19:32] (Sort of) [19:32] (Let's pretend I said it was) [19:32] *** bzc6p (~bzc6p@[redacted]) has joined #internetarchive.bak [19:32] I mean, it's definitely two copies, but occasionally two copies end up in the same building. [19:33] 89 genealogy.list [19:33] heh! well, we can fit that in somewhere [19:34] I wonder if that's really all of it. There is a weirdness in the census where an item can be in multiple collections, and the data I'm working from just picked the first one [19:34] I think it's not. [19:34] It's huge. [19:35] oh, I ran git annex sync in IA.BAK, not IA.BAK/shard1 [19:35] that's confusing [19:36] hater: yes and no [19:41] it'd be nice if we could always just use parallel (or any of a dozen alternatives), but there are a couple of problems with it [19:42] interleaving the output of a bunch of git annex get commands is super annoying [19:43] the number of jobs to run simultaneously is not obvious; what we really care about is how much bandwidth we're using [19:43] some people want to use a lot, some people want to throttle it, some people want to do both at different times, or in different circumstances [19:43] yeah, a get that started/stalled to saturate would be great [19:44] this works alot better with 10 gets [19:45] *** patrickod is now known as patricko- [19:46] "collection":["1880_census","microfilm","americana","us_census","genealogy","additional_collections"] [19:47] some have a cap and don't care about throughput, but only the total data uploaded/downloaded [19:47] seeing a lot of that kind of thing.. that's presumably why genealogy has so few items, they all went to other more specific things [19:48] 35518 1880_census [19:48] *** SN4T14_ (~SN4T14@[redacted]) has joined #internetarchive.bak [19:49] 57995 jstor_virglawregi [19:51] *** closure wonders if we can go to 200 thousand or so per shard. Have not noticed much scaling issues with 100k files. Except for that startup delay for shuffling.. [19:54] 103554 nasa_techdocs .. that would be a nice shard [19:54] closure: this is a side issue, but I just noticed that every single iabak-helper I've ever run is still waiting around to do a sync every hour [19:55] because they bg? [19:55] yep [19:55] perhaps it should fork off a single syncer if none is running [19:56] let's see, what lock file program is portably available..? [19:56] I'm thinking maybe perl [19:57] *** SN4T14 has quit (Ping timeout: 512 seconds) [19:57] doesn't the assistant already do that? [19:57] you can try the ftp boneyard, it's big and has huge files [19:59] it does.. does some other stuff we maybe don't want [19:59] oh, util-linux has flock(1) now [20:05] IA.BAK/master 2c3a13d Joey Hess: use separate program for hourly background sync, and use lock file so only 1 runs [20:05] IA.BAK/server c061607 Joey Hess: this script seems to have bit rotted since I last ran it [20:05] [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veUDM [20:05] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veUD1 [20:10] here's a thought. SHARD2 could be made by taking the *smallest* collections, until we get to 100k files. [20:11] that turns out to be 3537 collections. [20:11] with the larger collections in it being ones like TheIncredibleSandwich, TheFivePercent, KingsInDisguise, HOLLER_band [20:11] public_library_of_science, usda-agriculturalhistoryseries [20:14] *** patricko- is now known as patrickod [20:22] *** patrickod is now known as patricko- [20:24] closure: "grep "version" -m 1" - that "-m 1"-part was not in the sourcecode when i wrote the patch [20:39] hater: so, it's ok now? [20:41] yes [20:48] http://iabak.archiveteam.org/candidateshards/ [20:48] so, that's some lists of collections, starting with the ones with less files. Most of the lists are 100k files [20:49] around 100-150 there are some interesting sets of collections [20:50] http://iabak.archiveteam.org/candidateshards/smallestfirst118.lst I like this one [20:50] http://iabak.archiveteam.org/candidateshards/smallestfirst118.lst [20:50] oop [20:50] has: NISTJournalofResearch, 1880_census, speedydeletionwiki [20:51] http://iabak.archiveteam.org/candidateshards/smallestfirst107.lst archiveteam + glennbeck + some jstor [20:54] http://iabak.archiveteam.org/candidateshards/smallestfirst10.lst nice grab bag [20:55] IA.BAK/server b39791a Joey Hess: add simple shard packer... [20:55] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veTLk [21:15] [IA.BAK] cancerAlot closed pull request #8: install the right arch of git-annex (master...master) http://git.io/jpBr [21:16] SketchCow: another line for your graph.. http://iabak.archiveteam.org/stats/SHARD1.filestransferred [21:16] this will get 1 line added per hour, with the timestamp, and the total number of files transferred so far. [21:17] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veTnP [21:17] IA.BAK/server f5f8a53 Joey Hess: add filestransferred data file [21:17] [IA.BAK] cancerAlot opened pull request #10: checks if "reserve" is less than the available space (master...master) http://git.io/veTnH [21:18] closure: https://github.com/ArchiveTeam/IA.BAK/pull/10 closes the issue: https://github.com/ArchiveTeam/IA.BAK/issues/7 [21:22] IA.BAK/master 13e1b64 cancerAlot: Merge pull request #1 from ArchiveTeam/master... [21:22] IA.BAK/master 6c432e4 cancerAlot: bugfix because the 'git-annex version' output was changed [21:22] IA.BAK/master 7e2cf45 cancerAlot: . [21:22] [IA.BAK] joeyh closed pull request #10: checks if "reserve" is less than the available space (master...master) http://git.io/veTnH [21:22] [IA.BAK] joeyh pushed 5 new commits to master: http://git.io/veTCQ [21:38] *** svchfoo2 has quit (Ping timeout: 240 seconds) [21:39] *** svchfoo2 (~chfoo2@[redacted]) has joined #internetarchive.bak [21:39] *** svchfoo1 gives channel operator status to svchfoo2 [21:53] *** patricko- is now known as patrickod [21:55] *** patrickod is now known as patricko- [22:12] who is able to add my ssh public key? [22:14] could you implement a thread option in the script ? i now have to run 4 tty to reach 1.2-2MiB/s. i would like to max out my 7MB/s without opening another 15. [22:15] zottelbey: parallel downloading is in progress [22:15] neat. [22:15] https://git-annex.branchable.com/todo/parallel_get/ [22:15] hater, also for pub key anyone with write access to the git rep. [22:16] i dont care about the output tbh i just want it to be faster. [22:17] iabak could just run n copies of git-annex. [22:19] "Last edited 3 months and 15 days ago" tells me git-annex is not going to get there any time soon probably. [22:24] the author of git-annex is present and has been making changes to better suit it for this project [22:24] no need to be an ass [22:25] hater: I can add your key [22:25] yipdw, sorry, didnt mean to offend anyone. [22:25] np, I probably read too far into it [22:25] probably. [22:26] db48x: i sent the link in the query [22:26] hm, is it possible to make the script store the "backup data" in a different location than the iabak scripts etc? Having issues, script seems to create symlinks, which doesn't play well with smb/cifs shares [22:27] IA.BAK/pubkey 9886116 Daniel Brooks: add hater's public key [22:27] [IA.BAK] db48x pushed 1 new commit to pubkey: http://git.io/veTMP [22:27] db48x: thx [22:27] yw [22:29] Kazzy: it's currently not an option, but you can edit iabak-helper to change the location [22:30] Kazzy: you could also go into the shard1 directory and run 'git annex direct' to switch to direct mode, which doesn't use symlinks [22:32] hm, i'll take a shot at changing paths in the iabak-helper script, see if it'll work that way, cheers [22:33] Kazzy: if something useful comes out, push it into the repo ;) [22:34] well, at first it'll just be changing the hardcoded paths, I guess, will see where it goes from there [22:47] [IA.BAK] kurtmclester opened pull request #11: Changed key. -Kazzy (pubkey...pubkey) http://git.io/veTHj [22:50] IA.BAK/pubkey 6a6a11d Kurt: Changed key. -Kazzy [22:50] IA.BAK/pubkey ccb9ea4 Daniel Brooks: Merge pull request #11 from kurtmclester/pubkey... [22:50] [IA.BAK] db48x closed pull request #11: Changed key. -Kazzy (pubkey...pubkey) http://git.io/veTHj [22:50] [IA.BAK] db48x pushed 2 new commits to pubkey: http://git.io/veTQK [22:52] git-annex-shell: user error (git ["config","--null","--list"] exited 126) [22:53] can you show the output from prior to that? [22:53] Checking ssh to server... [22:53] only bit before that is: Hit Enter once you're signed up! [22:53] then throws that error, and asks me to sign up for access again [22:54] Kazzy: my guess is you've mangled the path to the git repo on the server [22:55] git remote add origin "$user:$dir" [22:55] since that uses $dir, if you changed it, it'll look in the wrong place on the server [22:56] oh right hm, yeah didn't notice all the $dir references in there, will try some more poking [22:59] all those dir variables should probably stay as-is [23:04] just change to a different directory before the if on line 126 [23:05] oho, stats have been broken today! [23:06] seems I have a stupid permissions error [23:09] omg [23:10] +1 and +2 have *both* overtaken +0 in the stats! [23:10] numcopies +0: 17275 [23:10] numcopies +1: 42961 [23:10] numcopies +2: 32420 [23:10] numcopies +3: 10149 [23:10] numcopies +4: 490 [23:10] *** closure wonders if SketchCow's script wil handle this, I forgot it sorted it like that [23:11] at 2 am yesterday, we had numcopies +0: 54519 .. [23:14] we've doubled the total files transferred today [23:20] Kazzy: https://github.com/db48x/IA.BAK/commit/a320bbbf0abd1359c0b20fbe7f412864437fa357 [23:21] oh ncie, thanks for that one [23:21] will take a shot at that now [23:23] i love this channel; someone ask for a feature and it is available in less than an hour [23:30] *** zottelbey has quit (Remote host closed the connection) [23:31] http://teamarchive1.fnf.archive.org/ia.bak/graph.html \o/ [23:35] *** patricko- is now known as patrickod [23:46] *** patrickod is now known as patricko-