[00:13] closure: very cool :) [01:01] *** patricko- is now known as patrickod [01:28] git-annex seems to ignore annex-ssh-options when running git annex find? [01:29] (I have no idea how git-annex, works.. i assume that tries to connect to isbak.archiveteam.org when it runs.) [01:29] git annex find only needs to look at the local metadata [01:34] oops, debug messages i threw in threw me.. git annex sync doesn't play nicely with ssh-options [01:34] https://github.com/db48x/IA.BAK/blob/a320bbbf0abd1359c0b20fbe7f412864437fa357/iabak-helper#L21 [01:36] Kazzy: well, git annex sync didn't used to support the ssh-options, but that was improved in february and afaik it does now [01:37] hm, this was downloaded today from https://downloads.kitenet.net/git-annex/linux... etc [01:37] is there a newer version out there somewhere? [01:37] *** patrickod is now known as patricko- [01:37] if it wasn't working, we'd not be getting stats updates from all the clients [01:37] Kazzy: are you running git-annex in your regular PATH, or the one that was downloaded? [01:38] *** patricko- is now known as patrickod [01:38] from the downloaded copy [01:42] closure: flock: invalid option -- 'E' [01:42] flock (util-linux 2.20.1) [01:42] oh, ok [01:42] although that's the wheezy version, let me go to jessie [01:42] that makes it not exit 1 when the lock is busy [01:44] yeah, flock from util-linux 2.25.2 has -E [01:44] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/vek2l [01:44] IA.BAK/server 926dc43 Joey Hess: comment typo [01:51] trs80: should be fixed [01:56] closure: what version of git-annex are you currently using? [01:57] 5.20150219 or newer will work [02:02] totally weird behaviour, really.. can auth fine using the key itself, but when the script calls it, we don't play ball [02:04] if you have a ~/.config/git-annex/program that points at some other, old version you have installed, that could possibly expain it [02:05] hmm, no, I made a change recently that prevents that being a problem [02:06] nope, ~/.config/git-annex/ doesn't exist at all, i changed the script to use the copy saved in a directory i downloaded it to earlier, from the link in the script [02:06] git-annex version: 5.20150327-g0ae1f8c [02:07] my guess comes down to "I changed the script" ... [02:08] running 'git annex sync' in the shard1 directory asks for the password, even outside the script [02:09] the only real difference is the location that remote.origin.annex-ssh-options points to for the key, it's an absolute path [02:22] *** Disconnected (Connection reset by peer). [02:23] *** Now talking on #internetarchive.bak [02:23] *** Topic for #internetarchive.bak is: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK | #archiveteam [02:23] *** Topic for #internetarchive.bak set by chfoo!~chris@[redacted] at Wed Mar 4 18:38:46 2015 [02:25] *** chfoo has quit (Read error: Operation timed out) [02:27] *** You are now known as chfoo [03:00] closure: does the "IA Only" include dark items? [03:01] (ie, we'll never get rid of all of the red on that graph?) [03:01] no, dark items were excluded from the census, and therefore weren't added to the shard [03:02] There have since been a lot though [03:02] I get a lot of 403s, at least [03:03] http://archive.org/history/housegarden140julnewy [03:03] for example [03:03] isn't dark, but is in printdisabled collection [03:03] which makes it "private" but not "dark" [03:05] hmm, 12 days [03:05] I think that change is newer than the census [03:06] yes, the census was taken 3-4 weeks ago: https://archive.org/details/ia-bak-census_20150304 [03:08] so I guess the answer is that there will eventually be some small number of IA Only items left [03:13] we can remove them from the repo if we want to [03:18] IA.BAK/server 94105b3 Joey Hess: run gc as SHARD user, not root, which was messing up perms and stats [03:18] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/vekFJ [03:22] *** ENDING LOGGING AT Thu Apr 2 22:22:26 2015 [03:23] *** BEGIN LOGGING AT Thu Apr 2 22:23:08 2015 [03:37] IA.BAK/server 3bc8668 Joey Hess: cache geoip lookups [03:37] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/vekAO [03:39] only 1193 files remain that have not been downloaded at least once [03:39] those may all be dark at this point? [03:45] underscor: I didn't know about printdisabled collection.. [03:47] I guess it's unlikely items in it will change.. tending toward deleting those from the repo [03:52] will it finish the git-annex if it can't get some because they have been darked? [03:53] it will finish, just will say some files failed to download [03:53] usenet-alt/alt.sex.pictures.mbox.zip [03:53] hah that's one file we've not gotten yet [03:53] that is not dark [03:53] *** closure watches for that file suddenly jump to 10 copies [03:58] i wonder what my disk bottleneck is like running 16 processes at once [04:11] Hrm, y'know, I've been idling here long enough. [04:12] I should actually go ahead and get a clone of the annex going. :) [04:13] *** pikhq would like to request being on the permitted hosts to access the repo [04:13] pikhq: I'll add you, paste me the key [04:14] Thanks. [04:15] IA.BAK/pubkey 6918741 Joey Hess: add pikhq [04:15] [IA.BAK] joeyh pushed 1 new commit to pubkey: http://git.io/veIJa [04:16] that red patch is getting delightfully small [04:21] Hooray, "downloading from the Internet Archive". [04:38] IA.BAK/server 546181a Joey Hess: typo [04:38] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIkC [04:43] IA.BAK/server 74438bc Joey Hess: fix su call params [04:43] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIIT [04:54] IA.BAK/server 0a5354e Joey Hess: support for more shards [04:54] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veILH [05:01] I've been thinking about the stats page [05:01] I think we need time-series data, to show the rate of progress [05:03] we could use Graphite, and feed the existing stats into it whenever they're updated [05:13] a simpler improvement would be to add the item count to the tooltips in the treemap: [05:13] generateTooltip: function (row) { return "Of all the items in this shard, "+ data.getValue(row, 2) +" have "+ data.getValue(row, 3) +" known copies."; } [05:20] db48x: I have some data files that I think you could use [05:20] http://iabak.archiveteam.org/stats/SHARD1.filestransferred [05:20] http://iabak.archiveteam.org/stats/SHARD1.clientconnsperhour [05:21] please come up with a visualization [05:30] I can't do a visualization, but I can do a chart [05:37] I think he wants me to do it. [05:37] Well, so, my stats page stuff is good for one-offs [05:38] Also, I wanted someone else to do visualization, but nobody was stepping forward, so I started it. [05:39] Also, looks like we're just on the edge of having zero IA-only items in the shard. [05:40] IA.BAK/pubkey 92f416f Joey Hess: copy SHARD1 pubkeys to SHARD2 [05:40] [IA.BAK] joeyh pushed 1 new commit to pubkey: http://git.io/veIGg [05:41] SketchCow: that generateTooltip line above could be added to the options that you pass to tree.draw [05:46] closure: assuming we had a graphite server, something like this would push the current shardstats into it: [05:46] echo "something.something.numcopies.${mult}.${SHARD} ${count} ${DATE}" | nc ${GRAPHITE_HOST} ${GRAPHITE_PORT} [05:46] while read mult count; do [05:46] done < grep 'numcopies +' "/var/www/html/stats/$SHARD"; [05:46] Apparently I picked a halfway-decent time to start grabbing stuff: when there was a point in fetching new things. :) [05:47] should be modified slightly to feed it all of the stats in a single TCP connection [05:47] and we've got to name the stats fairly well, and pick storage policies for them, and so on [05:48] but then Graphite puts them into a time-series database (Whisper), and makes querying that via HTTP really easy [05:49] IA.BAK/server 4e1486f Joey Hess: typo [05:49] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veInB [05:50] IA.BAK/server 3a1b03a Joey Hess: typo [05:50] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veInD [05:52] IA.BAK/server e04331c Joey Hess: fix cd [05:52] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIcv [05:53] IA.BAK/server d3bab05 Joey Hess: remove dirname complication, not needed [05:53] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIc8 [05:54] IA.BAK/server 41a6f38 Joey Hess: typo [05:54] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIcu [06:01] We're going to break the zero-IA barrier [06:06] IA.BAK/master 229385f Joey Hess: initial support for multiple shards [06:06] IA.BAK/master e819d37 Joey Hess: typo [06:06] IA.BAK/master e9adea4 Joey Hess: fix set -- lines [06:06] IA.BAK/server 25d0680 Joey Hess: few fixed to shard user setup [06:06] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIlU [06:06] [IA.BAK] joeyh pushed 4 new commits to master: http://git.io/veIlT [06:07] so, I've set up SHARD2. If someone wants to jump the gun, see the README for how to add another shard to your system [06:07] SHARD2 is 3.2 tb IIRC [06:07] it contains: NISTJournalofResearch, 1880_census, speedydeletionwiki [06:11] IA.BAK/server a5325ee Joey Hess: new script [06:11] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veI8p [06:14] I'd like to do some tests with SHARD1 when it's synced [06:14] yes, absolutely [06:15] quite a few things to work out.. need to test a restore of some files. [06:15] and, need to get the fscking working and make sure it notices when a client drops out, and deals with that [06:17] I just added a repolist file to the repo, that lists the shards. I'm thinking they can be in one of a few states [06:17] active: how SHARD1 is now, being filled in [06:17] pending: like SHARD2, not recommended to use yet [06:18] maintenance: just sitting there and being checked periodically [06:18] if a client falls out, a shard may change from maintenance back to active, so it'll get more clients to fix it back up [06:20] oh, and also restore: clients should try to upload files from this shard that are marked as not being present in the IA any longer [06:22] I think one thing is I'd like a random client to delete 50-60gb of material [06:24] closure: ./iabak-helper: 54: ./iabak-helper: numfmt: not found [06:25] and/or a client to quietly vanish and never be heard from again [06:25] (for all we know, this has already happened) [06:27] [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veIE5 [06:27] IA.BAK/master 1f8be49 Joey Hess: if numfmt is not available (old coreutils), skip the diskreserve sanity check [06:46] mappy [06:58] http://iabackup.archiveteam.org/ia.bak/ now has US [06:58] It loads in a little weird and slow for me, but I'll leave it for now. [07:07] closure: syncing now needs a password; was that intentional? [07:16] that US map is freaky accurate [07:16] SketchCow: is that all geolocation data? [07:17] yes, it's built off the geoip utility [07:17] "freaky accurate" meaning "I can see myself in that map" [07:17] ooh lithuania get [07:17] https://freegeoip.net [07:17] So, you can go and do: [07:17] freegeoip.net/csv/8.8.8.8 [07:17] freegeoip.net/json/github.com [07:18] And it'll do things in that format. [07:18] As you can see, you can use hostnames or IPs. [07:18] You get a limit of 10,000 free requests a day. [07:18] now I'm wondering who the others in Chicagoland are [07:19] They're not. [07:19] I knew it, knew I'd have to go ahead and get that right. [07:19] Ach, coding [07:20] I'm using cities [07:20] Just cities, no states. Goog tries to do its best. [07:20] I have to go in and make that different now. [07:20] ahh [07:20] For example, it's Deerfield, NH. Not Deerfield, IL [07:25] dang and I was getting excited that there were others around here [07:29] :) [07:29] db48x: shouldn't need password. Seems to work ok from here. [07:29] OK, so. [07:29] I redid it. [07:29] It uses zip code instead of city. [07:30] You can see now there ARE two clients in chicagoland [07:32] closure: hrm, it's only broken on one machine... [07:33] *** yipdw has changed the topic to: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK | Stats: http://iabackup.archiveteam.org/ia.bak/ | #archiveteam [07:33] db48x: must be a ssh key issue [07:37] repo has the key in it... [07:38] IA.BAK/master 28755e9 Joey Hess: disable auto gc by default... [07:38] [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veI1K [07:40] closure: does shard2 require a different ssh key adding location? [07:40] or will it use the same user as shard1? [07:40] (I suppose I could read the scripts lol) [07:41] different directory on the pubkeys branch [07:42] Added information about the project at the bottom of the graph page. [07:44] SketchCow: s/which/wish/ [07:44] it was always perfect and that never happened [07:45] :) [07:48] *** bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak [07:49] closure: I upped my side with the graph to 10 minutes. [07:49] Simply because we're starting to see real progress [07:50] *** bzc6p has quit (Read error: Operation timed out) [07:50] Also: If in the future we have a country with a ton of clients (I guess Germany is getting there) it is not difficult to do additional ones. [08:48] closure: it's untested, but if you want to take a look, https://github.com/db48x/IA.BAK/commit/c8ea0e2d8c70bda9a117d03d933c179cf3dd265c has code for sending the stats to graphite [09:02] Very minor addition: My graph now looks at numbers higher than 3 and just smooshes them in [09:02] Apparently some files/items are in six places now, meaning a total of 8! [09:07] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak [09:59] *** zottelbe- (~zottelbey@[redacted]) has joined #internetarchive.bak [10:00] *** zottelbey has quit (Ping timeout: 512 seconds) [10:11] *** zottelbe- is now known as zottelbey [10:53] *** db48x has quit (Ping timeout: 258 seconds) [12:45] underscor: ./iabak-helper: numfmt: not found what's your OS? [13:39] *** closure looks at beautiful docker-style multi progress bars [13:39] oh yesss [13:40] Working 85% [========================================= ] 17/ 20 (for 1.7, 0.3 remaining) [13:40] Working 20% [========== ] 4/ 20 (for 1.6, 6.4 remaining) [13:40] Working 40% [==================== ] 8/ 20 (for 1.6, 2.4 remaining) [13:40] guy who wrote the library explicitly wants to support git-annex [13:41] nice [13:42] when is the patch ready? :D [13:42] want to try it myself [13:42] well, I still have to write all the code to use the library, and he has to fix at least one of the 5 bugs I've filed on it this morning [13:50] looking forward to it! [14:20] *** bzc6p_ is now known as bzc6p [14:25] what's the total size of shard1? [14:27] http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation - 2.91TB [14:33] ah, for some reason I thought it was 1.9TB [14:33] still got another TB to go [14:36] im at a measly .06TB -.- [14:36] trs80: how long did it take to download 1.9TB? [14:37] closure: does it even make sense to implement a multi-support into git-annex and not into the wrapper-script? [14:38] it's been about 36 hours with 10 instances of iabak [14:38] at about 150Mb/s [14:40] *** trs80 fires up another 10 [15:14] http://iabackup.archiveteam.org/ia.bak [15:14] So, I'm noticing that +0 is at 22. I'm intrigued that it's still sticking around, even after hours. [15:14] Maybe they're really, really huge? Or something else? [15:15] there are some 403s [15:15] OK. So 22 ones that turn out to be private. [15:16] not clear what's going on, I just ran git annex get --not --copies 4 again, and it managed to download 1 more file [15:16] that had failed earlier [15:41] tonight I'm going to move this drive to a different computer [15:41] so if you want to see what it looks like when a node disappears... [15:46] And we do [16:41] going from 10->20 iabaks has hit diminishing returns, was getting 150Mbps and now 200Mbps [16:45] oh well. i get 1.2-2MB/s with 8. [16:47] look at that red evaporate [16:48] Will records of this copy just expire after failing to sync for a while? or do i have to remove the data and then sync again? [16:52] closure: I told you graphs were important! [16:52] It really helps get a grip on what's there. [17:02] sep332: the distributed fsck stuff that will let us notice when a copy drops out is implemented in git-annex, but the scripts need to be made to run it periodically [17:07] *** sankin1 (~sankin@[redacted]) has joined #internetarchive.bak [17:12] *** londoncal (~londoncal@[redacted]) has joined #internetarchive.bak [17:13] So when another node runs fsck --distributed, it will notice that my node hasn't checked in in a while? [17:17] we run fsck --expire on the server, and it notices [17:17] the nodes run fsck --distributed to avoid being expired [17:19] ok [17:22] closure - is --expire run automatically? [17:23] (or is this a feature you're adding?) [17:23] we can run --expire in a cron job [17:23] on the server, so easy [17:23] running fsck --distributed on the clients periodially is harder [17:24] means the clients have to run iabak from time to time even once they're full [17:24] and users have to remember, or it be automated somehow, etc [17:24] I am assuming that. [17:24] Like, I'm assuming the client will do that. [17:24] My estimate is once every two weeks. [17:25] And it should be automated, and our server should mail the contact to go "we haven't heard from you in X days" [17:25] something on the order of a month, 2 weeks might work. It will increase the git repo size over time some [17:25] By how much [17:25] A few k, right [17:25] one of the things I am going to check with shard1 [17:25] dunno exactly [17:26] so the email contact.. I was thinking about that too, and it's worth it, but we need registration then [17:27] Agreed. [17:28] Email's especially going to be worth it if anyone's using mostly offline disks for this. [17:28] Well, without communication, or a way to mail out updates, it won't scale anyway. [17:28] I mean, we can do 30 people relatively well. [17:28] e.g. have a spare external drive that they're shoving bits on rather than spare space on a system they use. [17:28] But we haven't figured out how many people this is going to be. Possibly over 5000. [17:28] pikhq: I agree, and we can also give people we trust longer offline between checkins before they get expired [17:29] SketchCow: 25000, back of the napkin estimate [17:30] we have 1 shard almost done with around 15 people [17:31] 25,000 assumes people don't take up roles. [17:31] You mean 25,000 "Shard Roles" [17:32] We don't want the same person taking two copies of the same shard, but we want people taking 5-50 shards where possible. [17:33] Tell me when I'm wrong. [17:33] although.. this shard is probably smaller than average disk size. average shard is probably 11 tb. [17:34] SketchCow: if people take 5-50 shards (which does make sense), they are probably not storing a full tb of each shard, which some of the people are for SHARD1. [17:35] So, two obvious thoughts I will probably restate multiple times [17:36] 1. The amount this project backs up will always be a fraction of the Archive's stores, simply because of the realistic situation of not sharing non-public items and the fact that some of the contents are warmed over horseshit [17:36] 20 petabytes is 42000 x 500 gb, so 25000 people seems ballpark [17:36] 2. The project will "succeed" when it gains enough traction that various companies donate disk server space to the project, real space. Real REAL space. [17:37] Whether we stay down in Ham Radio Club-level adoption or SETI-level is unclear. [17:37] What this IS doing is making git-annex better and it's making IA ask very important questions it needed to ask 10 years ago. [18:09] *** wp494_ (~wickedpla@[redacted]) has joined #internetarchive.bak [18:12] *** wp494 has quit (Ping timeout: 740 seconds) [18:13] *** wp494_ is now known as wp494 [18:15] *** beardicus (~beardicus@[redacted]) has joined #internetarchive.bak [18:17] *** wp494 has quit (LOUD UNNECESSARY QUIT MESSAGES) [18:20] ahoy! [18:21] who shall recieve my glorious public key? [18:21] *** wp494 (~wickedpla@[redacted]) has joined #internetarchive.bak [18:23] me [18:23] bestow it upon me [18:23] get ready!!!!! [18:23] ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDCGLT7D7/XP2ZhTF0sEBU1L1cOWXWeEAGNa/P4OvjZfaaA4Sew6V9jTvK8fL3bX [18:23] H/z9dcSHZ4mpLj6+7RvSo2F8jpxc9ggO/dVu6bxtnicsP6Yha6yeFwu6V7n5zDOQ57YSRzo8OR3tfzumH6Gg08nzdmSkmjMrOFk34 [18:23] K32sXINKkmOL9ekJneBkIx7yQ1buFXXMQl57JsKD3QE1kGWM1oWidMsNF8Q3WJ1mmE6yS3Pa489WW4B8frgfYU4UeA0UmlPBCH0cR [18:23] f3M9jKkN4ET8Q5zfeI42J6ZvO6cr2INKwDKUA1GFOk8zDeWE4JAMv9GNVqShKbv3KsbqMZEo5bclX bert@laslo [18:23] doh. [18:24] can pastebin that if you prefer. newb mistake. [18:24] one line plz. /msg is ok [18:27] IA.BAK/pubkey 0cb8d91 Joey Hess: I am beardicus! [18:27] [IA.BAK] joeyh pushed 1 new commit to pubkey: http://git.io/vemK3 [18:27] yer in [18:27] thanky hanky. [18:29] no dice closure. patience required or mis-paste? [18:31] seems I need to fix something on the server [18:31] key didn't deploy [18:31] try it now [18:32] success. [18:32] or wait... [18:32] hmm, all I did was hit the webhook. Perhaps github is being slow.. [18:32] it's asking me for SHARD1's password. [18:32] so not so much. [18:34] closure: http://pastebin.com/Mh7yjGKV [18:35] so, it shouldn't be prompting for a password, ssh is being run with -o BatchMode=yes [18:36] and, it managed to connect to the server once with the key [18:36] interesting. Maybe we need to enable batchmode all the time, not just in the initial probe [18:37] muxserver_listen: link mux listener .git/annex/ssh/SHARD1@iabak.archiveteam.org.QK8zOCbtNebI7q54 => .git/annex/ssh/SHARD1@iabak.archiveteam.org: Operation not permitted [18:37] I have never seen ssh say that before [18:38] are you perhaps using this on a filesystem that does not support unix sockets? [18:38] heh. it's on an nfs share. :( [18:38] try this: cd shard1; git config annex.sshcaching false [18:38] and then re-run ./iabak [18:39] that got rid of the muxserver error, but the password prompt remains. [18:40] beardicus: check the permissions of id_rsa in the .git/anex/ directory [18:40] yeah, that sounds wise [18:40] i think it's possible that due ot being a nnfs share, permissions screw up, and ssh will (for some reason silently) refuse to use the key [18:40] to*, nfs* [18:41] maybe try: ssh -i shard1/.git/annex/id_rsa -v SHARD1@iabak.archiveteam.org [18:41] and see why ssh is refusing to use that id_rsa [18:42] -o IdentitiesOnly yes might also be useful if this is a recent Ubuntu system and you have many SSH keys [18:42] id_rsa is rw user only. it is user "nobody" though. so that's probably it. [18:43] beardicus: also, I'm curious about the fifo problem. git-annex probes to see if the filesystem supports them. but its probe seems to have missed this case [18:44] I wonder if you could try, on the nfs share: mkfifo foo, and then if that succeeds, see if you can ln foo bar [18:44] beardicus: are you running this as nobody? [18:44] i am not. [18:45] so, nfs ate owner?! [18:45] *** closure remembers there was a reason he stopped using nfs in 1996 [18:45] yeah. nfs is a black hole for me. can't think of a better way to permanently mount my nas on my linux box. [18:46] well, you can run git-annex and iabak on some nas's :) [18:46] but I would like to investigate the nfs problems a bit more [18:46] yeah. thought about it. i think it's x86-based. [18:47] it create IA.BAK/id_rsa and then copys it to IA.BAK/shard1/.git/annex/id_rsa .. is only the copy owned by the wrong user [18:47] ? [18:49] everything on the mount ends up as nobody/nogroup [18:49] could not hard link that fifo closure [18:50] hmm, it was able to use the id_rsa once [18:50] so you were able to create the fifo? [18:50] yes. [18:50] cool.. I guessed right about the problem with the fifo! [18:50] will fix that [18:53] guess i need to figure out all these nfs permissions baloney before i proceed. Going to try running it on the nas. [18:54] actually, you can hack around it [18:54] just do this: copy id_rsa from the nfs to your home directory someplace [18:54] and then, edit IA.BAK/shard1/.git/config [18:55] there's a line that has the path to the id_rsa file, just update that [18:55] we could even automate this [18:55] check if the file is owned by whoami and if not, put it in $HOME/ [18:58] no dice. [18:58] copied id_rsa to ~/.ssh/at_id_rsa [18:58] permissions are rw only for user, owner is current user [18:59] edited shard1/.git/config [18:59] "annex-ssh-options = -i ~/.ssh/at_id_rsa" [18:59] not sure if ~ works in there [19:00] tried it with a full path. same password prompt. [19:02] is group still nobody? [19:02] nope. [19:06] poking around with the on the nas... busybox `find` command chokes on the -printf option. [19:06] this is used in sharddirs="$(find . -maxdepth 1 -name shard\* -type d -printf "%P\n")" [19:07] I think you'll have better luck debugging ssh [19:08] *** db48x (~user@[redacted]) has joined #internetarchive.bak [19:08] *** svchfoo2 gives channel operator status to db48x [19:10] Shift-reload on the graph page shows the blocks getting smaller and the 4 block getting bigger. [19:13] *** db48x wonders why his computer was off [19:15] beardicus: does .git/config have references to crippled mode? [19:16] Kazzy no. [19:21] *** londoncal has quit (Quit: Leaving...) [19:28] *** matthusb- (~matthusby@[redacted]) has joined #internetarchive.bak [19:33] working ok on a synology nas after working around that printf issue. [19:33] *** Nemo_bis (~Nemo_bis@[redacted]) has joined #internetarchive.bak [19:33] though i still had to fix permissions on the id_rsa file. [19:36] *** kniffy (~kniffy@[redacted]) has joined #internetarchive.bak [19:36] *** Nemo_bis (~Nemo_bis@[redacted]) has left #internetarchive.bak [19:48] *** SN4T14__ (~SN4T14@[redacted]) has joined #internetarchive.bak [19:56] *** SN4T14_ has quit (Ping timeout: 512 seconds) [20:13] beardicus: awesome! [20:13] beardicus: hey, you could open a pull request with the printf workaround [20:13] not sure it went all the way. [20:13] is shard1 only about 600mb? [20:13] no, it's much bigger [20:14] yeah. i got a message that said it's fully backed up, and then exit to shell. [20:14] I think we're not done with SHARD1, so probably some problem there [20:15] yeah. must've just downloaded the index. [20:17] if i run iabak again, i get a bunch of 403s. [20:18] well, the 403's are known [20:18] ok. [20:18] but, your system seems to not be noticing the files that remain to get [20:18] i'll leave it going. [20:19] i'll see if it starts downloading some stuff. [20:19] do you have a /usr/bin/shuf? [20:19] i don't, not on the nas. [20:19] i wonder if busybox can do that and they just didn't link it. [20:20] you don't need it, I was just wondering which branch it was running [20:20] no shuf in busybox i guess. [20:21] it sounds like it tried to download only the files that no other node has downloaded yet [20:21] where instead, it's supposed to download files that are not on at least 3 other nodes [20:21] try this: cd shard1 ; git annex find --not --copies 4 [20:21] well it seemed to think all files were on four nodes. [20:21] see if it finds any files [20:22] seems so, but I don't think we're there yet are we? [20:23] git annex find --copies 2 --not --copies 3 | wc -l gives me 40383 [20:23] this could be it "git: 'annex' is not a git command. See 'git --help'." [20:24] beardicus: oh, I forgot to say, run IA.BAK/git-annex.linux/runshell first [20:26] closure that command is finding some files. [20:26] hmm, weird [20:26] try git annex get --not --copies 4 [20:26] that's what the script is supposed to run, in your case (with no shuf) [20:27] yeah. that looks like the output i get when i rerun iabak now. [20:27] so i should probably just let it run through all the 403s until it gets to the undone stuff. [20:28] doesn't really explain why it gave up in the first place though. [20:28] oh, I thought you said it got through them all and said it was done [20:28] *** bzc6p has quit (Read error: Operation timed out) [20:28] no, it downloaded what i assume is an index of some sort, and then claimed it was all done and exited back to shell. [20:29] hmm [20:30] what kind of claim it was done? [20:30] *** bzc6p (~bzc6p@[redacted]) has joined #internetarchive.bak [20:30] this is where real output pastes are useful [20:30] "Wow! This shard of the IA is fully backed up now!" [20:30] http://pastebin.com/dU9FQrrv [20:32] yeah, looks like the git annex get step somehow didn't run at all [20:32] I don't understand how [20:32] if it's plowing through 403's now, that step is certianly running this time [20:33] heh, after a sync I see only 8k left [20:34] not sure. gotta run to make dindin now. i'll try to do that pull request this evening closure. [20:35] there's a lot of better homes and gardens and ladies home journal in there that is all 403 [20:37] usfederalcourts, man [20:37] Should have gone with that [20:38] $git-annex fsck --distributed [20:38] git-annex: unrecognized option `--distributed' [20:38] sep332: it's in the daily builds, next release [20:39] ok. would it be useful for me to get it and do that before moving this drive? [20:39] no, you don't need to run it [20:39] I mean, you can run it wherever the drive ends up, and we'll know the drive still has the files [20:40] ok thanks [20:40] or you can not run it, and we'll realize you are not keeping the files [20:40] which is one of the scenarios we need to test [20:40] yeah i meant for testing [20:54] *** sankin1 has quit (Leaving.) [21:01] uh oh, getting really long wait times on disk operations [22:14] oh boy, I can make this distributed fsck a lot more efficient [22:14] I think we could run fscks daily, or hourly if we wanted to, if I pull this off [22:18] I think it will be around 128 bytes per fsck per client [22:26] closure: just read about your changes to git annex get [22:26] as I understand it, "git annex get dir1 dir2" needs to download dir1/file1, dir1/file2, dir1/file3, dir2/subdir1/file1, dir2/subdir1/file2, dir2/subdir2/file1" etc, and you want them to be in the same order as the directories were specified on the command line? [22:27] yep [22:27] SketchCow: I broke the graph. Bonus: No numcopies +0 left! [22:27] why even generate the list of files ahead of time? why not produce a lazy sequence of them instead? [22:28] you can then produce them in sorted order, and amortize the cost over the whole time you're downloading things [22:28] git ls-files b c a will output a b c [22:30] I'd love to stream the list lazily, but I cannot when it's not generated in the right order [22:32] hmm [22:34] still seems like you'd be able to call ls-files on each dir individually, whenever the list runs out [22:36] closure: Did you remove the numcopies 0 stuff, or did it just happen to finally get all retrieved? [22:58] *** kyan (~kyan@[redacted]) has joined #internetarchive.bak [23:28] pikhq: git log says he removed the unavailable ones [23:32] *** kniffy (~kniffy@[redacted]) has left #internetarchive.bak (Leaving)