#internetarchive.bak 2015-04-03,Fri

↑back Search

Time Nickname Message
00:13 🔗 db48x closure: very cool :)
01:01 🔗 patricko- is now known as patrickod
01:28 🔗 Kazzy git-annex seems to ignore annex-ssh-options when running git annex find?
01:29 🔗 Kazzy (I have no idea how git-annex, works.. i assume that tries to connect to isbak.archiveteam.org when it runs.)
01:29 🔗 db48x git annex find only needs to look at the local metadata
01:34 🔗 Kazzy oops, debug messages i threw in threw me.. git annex sync doesn't play nicely with ssh-options
01:34 🔗 Kazzy https://github.com/db48x/IA.BAK/blob/a320bbbf0abd1359c0b20fbe7f412864437fa357/iabak-helper#L21
01:36 🔗 closure Kazzy: well, git annex sync didn't used to support the ssh-options, but that was improved in february and afaik it does now
01:37 🔗 Kazzy hm, this was downloaded today from https://downloads.kitenet.net/git-annex/linux... etc
01:37 🔗 Kazzy is there a newer version out there somewhere?
01:37 🔗 patrickod is now known as patricko-
01:37 🔗 closure if it wasn't working, we'd not be getting stats updates from all the clients
01:37 🔗 trs80 Kazzy: are you running git-annex in your regular PATH, or the one that was downloaded?
01:38 🔗 patricko- is now known as patrickod
01:38 🔗 Kazzy from the downloaded copy
01:42 🔗 trs80 closure: flock: invalid option -- 'E'
01:42 🔗 trs80 flock (util-linux 2.20.1)
01:42 🔗 closure oh, ok
01:42 🔗 trs80 although that's the wheezy version, let me go to jessie
01:42 🔗 closure that makes it not exit 1 when the lock is busy
01:44 🔗 trs80 yeah, flock from util-linux 2.25.2 has -E
01:44 🔗 GitHub103/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/vek2l
01:44 🔗 GitHub103/#internetarchive.bak IA.BAK/server 926dc43 Joey Hess: comment typo
01:51 🔗 closure trs80: should be fixed
01:56 🔗 Kazzy closure: what version of git-annex are you currently using?
01:57 🔗 closure 5.20150219 or newer will work
02:02 🔗 Kazzy totally weird behaviour, really.. can auth fine using the key itself, but when the script calls it, we don't play ball
02:04 🔗 closure if you have a ~/.config/git-annex/program that points at some other, old version you have installed, that could possibly expain it
02:05 🔗 closure hmm, no, I made a change recently that prevents that being a problem
02:06 🔗 Kazzy nope, ~/.config/git-annex/ doesn't exist at all, i changed the script to use the copy saved in a directory i downloaded it to earlier, from the link in the script
02:06 🔗 Kazzy git-annex version: 5.20150327-g0ae1f8c
02:07 🔗 closure my guess comes down to "I changed the script" ...
02:08 🔗 Kazzy running 'git annex sync' in the shard1 directory asks for the password, even outside the script
02:09 🔗 Kazzy the only real difference is the location that remote.origin.annex-ssh-options points to for the key, it's an absolute path
02:22 🔗 Disconnected (Connection reset by peer).
02:23 🔗 Now talking on #internetarchive.bak
02:23 🔗 Topic for #internetarchive.bak is: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK | #archiveteam
02:23 🔗 Topic for #internetarchive.bak set by chfoo!~chris@[redacted] at Wed Mar 4 18:38:46 2015
02:25 🔗 chfoo has quit (Read error: Operation timed out)
02:27 🔗 You are now known as chfoo
03:00 🔗 underscor closure: does the "IA Only" include dark items?
03:01 🔗 underscor (ie, we'll never get rid of all of the red on that graph?)
03:01 🔗 db48x no, dark items were excluded from the census, and therefore weren't added to the shard
03:02 🔗 underscor There have since been a lot though
03:02 🔗 underscor I get a lot of 403s, at least
03:03 🔗 underscor http://archive.org/history/housegarden140julnewy
03:03 🔗 underscor for example
03:03 🔗 underscor isn't dark, but is in printdisabled collection
03:03 🔗 underscor which makes it "private" but not "dark"
03:05 🔗 db48x hmm, 12 days
03:05 🔗 db48x I think that change is newer than the census
03:06 🔗 db48x yes, the census was taken 3-4 weeks ago: https://archive.org/details/ia-bak-census_20150304
03:08 🔗 db48x so I guess the answer is that there will eventually be some small number of IA Only items left
03:13 🔗 closure we can remove them from the repo if we want to
03:18 🔗 GitHub24/#internetarchive.bak IA.BAK/server 94105b3 Joey Hess: run gc as SHARD user, not root, which was messing up perms and stats
03:18 🔗 GitHub24/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/vekFJ
03:22 🔗 ENDING LOGGING AT Thu Apr 2 22:22:26 2015
03:23 🔗 BEGIN LOGGING AT Thu Apr 2 22:23:08 2015
03:37 🔗 GitHub19/#internetarchive.bak IA.BAK/server 3bc8668 Joey Hess: cache geoip lookups
03:37 🔗 GitHub19/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/vekAO
03:39 🔗 closure only 1193 files remain that have not been downloaded at least once
03:39 🔗 closure those may all be dark at this point?
03:45 🔗 closure underscor: I didn't know about printdisabled collection..
03:47 🔗 closure I guess it's unlikely items in it will change.. tending toward deleting those from the repo
03:52 🔗 tpw_rules will it finish the git-annex if it can't get some because they have been darked?
03:53 🔗 closure it will finish, just will say some files failed to download
03:53 🔗 closure usenet-alt/alt.sex.pictures.mbox.zip
03:53 🔗 closure hah that's one file we've not gotten yet
03:53 🔗 closure that is not dark
03:53 🔗 closure watches for that file suddenly jump to 10 copies
03:58 🔗 tpw_rules i wonder what my disk bottleneck is like running 16 processes at once
04:11 🔗 pikhq Hrm, y'know, I've been idling here long enough.
04:12 🔗 pikhq I should actually go ahead and get a clone of the annex going. :)
04:13 🔗 pikhq would like to request being on the permitted hosts to access the repo
04:13 🔗 closure pikhq: I'll add you, paste me the key
04:14 🔗 pikhq Thanks.
04:15 🔗 GitHub103/#internetarchive.bak IA.BAK/pubkey 6918741 Joey Hess: add pikhq
04:15 🔗 GitHub103/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to pubkey: http://git.io/veIJa
04:16 🔗 db48x that red patch is getting delightfully small
04:21 🔗 pikhq Hooray, "downloading from the Internet Archive".
04:38 🔗 GitHub100/#internetarchive.bak IA.BAK/server 546181a Joey Hess: typo
04:38 🔗 GitHub100/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIkC
04:43 🔗 GitHub130/#internetarchive.bak IA.BAK/server 74438bc Joey Hess: fix su call params
04:43 🔗 GitHub130/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIIT
04:54 🔗 GitHub138/#internetarchive.bak IA.BAK/server 0a5354e Joey Hess: support for more shards
04:54 🔗 GitHub138/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veILH
05:01 🔗 db48x I've been thinking about the stats page
05:01 🔗 db48x I think we need time-series data, to show the rate of progress
05:03 🔗 db48x we could use Graphite, and feed the existing stats into it whenever they're updated
05:13 🔗 db48x a simpler improvement would be to add the item count to the tooltips in the treemap:
05:13 🔗 db48x generateTooltip: function (row) { return "Of all the items in this shard, "+ data.getValue(row, 2) +" have "+ data.getValue(row, 3) +" known copies."; }
05:20 🔗 closure db48x: I have some data files that I think you could use
05:20 🔗 closure http://iabak.archiveteam.org/stats/SHARD1.filestransferred
05:20 🔗 closure http://iabak.archiveteam.org/stats/SHARD1.clientconnsperhour
05:21 🔗 closure please come up with a visualization
05:30 🔗 db48x I can't do a visualization, but I can do a chart
05:37 🔗 SketchCow I think he wants me to do it.
05:37 🔗 SketchCow Well, so, my stats page stuff is good for one-offs
05:38 🔗 SketchCow Also, I wanted someone else to do visualization, but nobody was stepping forward, so I started it.
05:39 🔗 SketchCow Also, looks like we're just on the edge of having zero IA-only items in the shard.
05:40 🔗 GitHub181/#internetarchive.bak IA.BAK/pubkey 92f416f Joey Hess: copy SHARD1 pubkeys to SHARD2
05:40 🔗 GitHub181/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to pubkey: http://git.io/veIGg
05:41 🔗 db48x SketchCow: that generateTooltip line above could be added to the options that you pass to tree.draw
05:46 🔗 db48x closure: assuming we had a graphite server, something like this would push the current shardstats into it:
05:46 🔗 db48x echo "something.something.numcopies.${mult}.${SHARD} ${count} ${DATE}" | nc ${GRAPHITE_HOST} ${GRAPHITE_PORT}
05:46 🔗 db48x while read mult count; do
05:46 🔗 db48x done < grep 'numcopies +' "/var/www/html/stats/$SHARD";
05:46 🔗 pikhq Apparently I picked a halfway-decent time to start grabbing stuff: when there was a point in fetching new things. :)
05:47 🔗 db48x should be modified slightly to feed it all of the stats in a single TCP connection
05:47 🔗 db48x and we've got to name the stats fairly well, and pick storage policies for them, and so on
05:48 🔗 db48x but then Graphite puts them into a time-series database (Whisper), and makes querying that via HTTP really easy
05:49 🔗 GitHub111/#internetarchive.bak IA.BAK/server 4e1486f Joey Hess: typo
05:49 🔗 GitHub111/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veInB
05:50 🔗 GitHub173/#internetarchive.bak IA.BAK/server 3a1b03a Joey Hess: typo
05:50 🔗 GitHub173/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veInD
05:52 🔗 GitHub89/#internetarchive.bak IA.BAK/server e04331c Joey Hess: fix cd
05:52 🔗 GitHub89/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIcv
05:53 🔗 GitHub49/#internetarchive.bak IA.BAK/server d3bab05 Joey Hess: remove dirname complication, not needed
05:53 🔗 GitHub49/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIc8
05:54 🔗 GitHub156/#internetarchive.bak IA.BAK/server 41a6f38 Joey Hess: typo
05:54 🔗 GitHub156/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIcu
06:01 🔗 SketchCow We're going to break the zero-IA barrier
06:06 🔗 GitHub61/#internetarchive.bak IA.BAK/master 229385f Joey Hess: initial support for multiple shards
06:06 🔗 GitHub61/#internetarchive.bak IA.BAK/master e819d37 Joey Hess: typo
06:06 🔗 GitHub61/#internetarchive.bak IA.BAK/master e9adea4 Joey Hess: fix set -- lines
06:06 🔗 GitHub27/#internetarchive.bak IA.BAK/server 25d0680 Joey Hess: few fixed to shard user setup
06:06 🔗 GitHub27/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veIlU
06:06 🔗 GitHub61/#internetarchive.bak [IA.BAK] joeyh pushed 4 new commits to master: http://git.io/veIlT
06:07 🔗 closure so, I've set up SHARD2. If someone wants to jump the gun, see the README for how to add another shard to your system
06:07 🔗 closure SHARD2 is 3.2 tb IIRC
06:07 🔗 closure it contains: NISTJournalofResearch, 1880_census, speedydeletionwiki
06:11 🔗 GitHub159/#internetarchive.bak IA.BAK/server a5325ee Joey Hess: new script
06:11 🔗 GitHub159/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veI8p
06:14 🔗 SketchCow I'd like to do some tests with SHARD1 when it's synced
06:14 🔗 closure yes, absolutely
06:15 🔗 closure quite a few things to work out.. need to test a restore of some files.
06:15 🔗 closure and, need to get the fscking working and make sure it notices when a client drops out, and deals with that
06:17 🔗 closure I just added a repolist file to the repo, that lists the shards. I'm thinking they can be in one of a few states
06:17 🔗 closure active: how SHARD1 is now, being filled in
06:17 🔗 closure pending: like SHARD2, not recommended to use yet
06:18 🔗 closure maintenance: just sitting there and being checked periodically
06:18 🔗 closure if a client falls out, a shard may change from maintenance back to active, so it'll get more clients to fix it back up
06:20 🔗 closure oh, and also restore: clients should try to upload files from this shard that are marked as not being present in the IA any longer
06:22 🔗 SketchCow I think one thing is I'd like a random client to delete 50-60gb of material
06:24 🔗 underscor closure: ./iabak-helper: 54: ./iabak-helper: numfmt: not found
06:25 🔗 closure and/or a client to quietly vanish and never be heard from again
06:25 🔗 closure (for all we know, this has already happened)
06:27 🔗 GitHub195/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veIE5
06:27 🔗 GitHub195/#internetarchive.bak IA.BAK/master 1f8be49 Joey Hess: if numfmt is not available (old coreutils), skip the diskreserve sanity check
06:46 🔗 closure mappy
06:58 🔗 SketchCow http://iabackup.archiveteam.org/ia.bak/ now has US
06:58 🔗 SketchCow It loads in a little weird and slow for me, but I'll leave it for now.
07:07 🔗 db48x closure: syncing now needs a password; was that intentional?
07:16 🔗 yipdw that US map is freaky accurate
07:16 🔗 yipdw SketchCow: is that all geolocation data?
07:17 🔗 SketchCow yes, it's built off the geoip utility
07:17 🔗 yipdw "freaky accurate" meaning "I can see myself in that map"
07:17 🔗 DFJustin ooh lithuania get
07:17 🔗 SketchCow https://freegeoip.net
07:17 🔗 SketchCow So, you can go and do:
07:17 🔗 SketchCow freegeoip.net/csv/8.8.8.8
07:17 🔗 SketchCow freegeoip.net/json/github.com
07:18 🔗 SketchCow And it'll do things in that format.
07:18 🔗 SketchCow As you can see, you can use hostnames or IPs.
07:18 🔗 SketchCow You get a limit of 10,000 free requests a day.
07:18 🔗 yipdw now I'm wondering who the others in Chicagoland are
07:19 🔗 SketchCow They're not.
07:19 🔗 SketchCow I knew it, knew I'd have to go ahead and get that right.
07:19 🔗 SketchCow Ach, coding
07:20 🔗 SketchCow I'm using cities
07:20 🔗 SketchCow Just cities, no states. Goog tries to do its best.
07:20 🔗 SketchCow I have to go in and make that different now.
07:20 🔗 yipdw ahh
07:20 🔗 SketchCow For example, it's Deerfield, NH. Not Deerfield, IL
07:25 🔗 yipdw dang and I was getting excited that there were others around here
07:29 🔗 db48x :)
07:29 🔗 closure db48x: shouldn't need password. Seems to work ok from here.
07:29 🔗 SketchCow OK, so.
07:29 🔗 SketchCow I redid it.
07:29 🔗 SketchCow It uses zip code instead of city.
07:30 🔗 SketchCow You can see now there ARE two clients in chicagoland
07:32 🔗 db48x closure: hrm, it's only broken on one machine...
07:33 🔗 yipdw has changed the topic to: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK | Stats: http://iabackup.archiveteam.org/ia.bak/ | #archiveteam
07:33 🔗 closure db48x: must be a ssh key issue
07:37 🔗 db48x repo has the key in it...
07:38 🔗 GitHub94/#internetarchive.bak IA.BAK/master 28755e9 Joey Hess: disable auto gc by default...
07:38 🔗 GitHub94/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veI1K
07:40 🔗 underscor closure: does shard2 require a different ssh key adding location?
07:40 🔗 underscor or will it use the same user as shard1?
07:40 🔗 underscor (I suppose I could read the scripts lol)
07:41 🔗 db48x different directory on the pubkeys branch
07:42 🔗 SketchCow Added information about the project at the bottom of the graph page.
07:44 🔗 db48x SketchCow: s/which/wish/
07:44 🔗 SketchCow it was always perfect and that never happened
07:45 🔗 db48x :)
07:48 🔗 bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
07:49 🔗 SketchCow closure: I upped my side with the graph to 10 minutes.
07:49 🔗 SketchCow Simply because we're starting to see real progress
07:50 🔗 bzc6p has quit (Read error: Operation timed out)
07:50 🔗 SketchCow Also: If in the future we have a country with a ton of clients (I guess Germany is getting there) it is not difficult to do additional ones.
08:48 🔗 db48x closure: it's untested, but if you want to take a look, https://github.com/db48x/IA.BAK/commit/c8ea0e2d8c70bda9a117d03d933c179cf3dd265c has code for sending the stats to graphite
09:02 🔗 SketchCow Very minor addition: My graph now looks at numbers higher than 3 and just smooshes them in
09:02 🔗 SketchCow Apparently some files/items are in six places now, meaning a total of 8!
09:07 🔗 zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
09:59 🔗 zottelbe- (~zottelbey@[redacted]) has joined #internetarchive.bak
10:00 🔗 zottelbey has quit (Ping timeout: 512 seconds)
10:11 🔗 zottelbe- is now known as zottelbey
10:53 🔗 db48x has quit (Ping timeout: 258 seconds)
12:45 🔗 hater underscor: ./iabak-helper: numfmt: not found what's your OS?
13:39 🔗 closure looks at beautiful docker-style multi progress bars
13:39 🔗 closure oh yesss
13:40 🔗 closure Working 85% [========================================= ] 17/ 20 (for 1.7, 0.3 remaining)
13:40 🔗 closure Working 20% [========== ] 4/ 20 (for 1.6, 6.4 remaining)
13:40 🔗 closure Working 40% [==================== ] 8/ 20 (for 1.6, 2.4 remaining)
13:40 🔗 closure guy who wrote the library explicitly wants to support git-annex
13:41 🔗 hater nice
13:42 🔗 hater when is the patch ready? :D
13:42 🔗 hater want to try it myself
13:42 🔗 closure well, I still have to write all the code to use the library, and he has to fix at least one of the 5 bugs I've filed on it this morning
13:50 🔗 zottelbey looking forward to it!
14:20 🔗 bzc6p_ is now known as bzc6p
14:25 🔗 trs80 what's the total size of shard1?
14:27 🔗 zottelbey http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation - 2.91TB
14:33 🔗 trs80 ah, for some reason I thought it was 1.9TB
14:33 🔗 trs80 still got another TB to go
14:36 🔗 zottelbey im at a measly .06TB -.-
14:36 🔗 hater trs80: how long did it take to download 1.9TB?
14:37 🔗 hater closure: does it even make sense to implement a multi-support into git-annex and not into the wrapper-script?
14:38 🔗 trs80 it's been about 36 hours with 10 instances of iabak
14:38 🔗 trs80 at about 150Mb/s
14:40 🔗 trs80 fires up another 10
15:14 🔗 SketchCow http://iabackup.archiveteam.org/ia.bak
15:14 🔗 SketchCow So, I'm noticing that +0 is at 22. I'm intrigued that it's still sticking around, even after hours.
15:14 🔗 SketchCow Maybe they're really, really huge? Or something else?
15:15 🔗 closure there are some 403s
15:15 🔗 SketchCow OK. So 22 ones that turn out to be private.
15:16 🔗 closure not clear what's going on, I just ran git annex get --not --copies 4 again, and it managed to download 1 more file
15:16 🔗 closure that had failed earlier
15:41 🔗 sep332 tonight I'm going to move this drive to a different computer
15:41 🔗 sep332 so if you want to see what it looks like when a node disappears...
15:46 🔗 SketchCow And we do
16:41 🔗 trs80 going from 10->20 iabaks has hit diminishing returns, was getting 150Mbps and now 200Mbps
16:45 🔗 zottelbey oh well. i get 1.2-2MB/s with 8.
16:47 🔗 closure look at that red evaporate
16:48 🔗 sep332 Will records of this copy just expire after failing to sync for a while? or do i have to remove the data and then sync again?
16:52 🔗 SketchCow closure: I told you graphs were important!
16:52 🔗 SketchCow It really helps get a grip on what's there.
17:02 🔗 closure sep332: the distributed fsck stuff that will let us notice when a copy drops out is implemented in git-annex, but the scripts need to be made to run it periodically
17:07 🔗 sankin1 (~sankin@[redacted]) has joined #internetarchive.bak
17:12 🔗 londoncal (~londoncal@[redacted]) has joined #internetarchive.bak
17:13 🔗 sep332 So when another node runs fsck --distributed, it will notice that my node hasn't checked in in a while?
17:17 🔗 closure we run fsck --expire on the server, and it notices
17:17 🔗 closure the nodes run fsck --distributed to avoid being expired
17:19 🔗 sep332 ok
17:22 🔗 SketchCow closure - is --expire run automatically?
17:23 🔗 SketchCow (or is this a feature you're adding?)
17:23 🔗 closure we can run --expire in a cron job
17:23 🔗 closure on the server, so easy
17:23 🔗 closure running fsck --distributed on the clients periodially is harder
17:24 🔗 closure means the clients have to run iabak from time to time even once they're full
17:24 🔗 closure and users have to remember, or it be automated somehow, etc
17:24 🔗 SketchCow I am assuming that.
17:24 🔗 SketchCow Like, I'm assuming the client will do that.
17:24 🔗 SketchCow My estimate is once every two weeks.
17:25 🔗 SketchCow And it should be automated, and our server should mail the contact to go "we haven't heard from you in X days"
17:25 🔗 closure something on the order of a month, 2 weeks might work. It will increase the git repo size over time some
17:25 🔗 SketchCow By how much
17:25 🔗 SketchCow A few k, right
17:25 🔗 closure one of the things I am going to check with shard1
17:25 🔗 closure dunno exactly
17:26 🔗 closure so the email contact.. I was thinking about that too, and it's worth it, but we need registration then
17:27 🔗 SketchCow Agreed.
17:28 🔗 pikhq Email's especially going to be worth it if anyone's using mostly offline disks for this.
17:28 🔗 SketchCow Well, without communication, or a way to mail out updates, it won't scale anyway.
17:28 🔗 SketchCow I mean, we can do 30 people relatively well.
17:28 🔗 pikhq e.g. have a spare external drive that they're shoving bits on rather than spare space on a system they use.
17:28 🔗 SketchCow But we haven't figured out how many people this is going to be. Possibly over 5000.
17:28 🔗 closure pikhq: I agree, and we can also give people we trust longer offline between checkins before they get expired
17:29 🔗 closure SketchCow: 25000, back of the napkin estimate
17:30 🔗 closure we have 1 shard almost done with around 15 people
17:31 🔗 SketchCow 25,000 assumes people don't take up roles.
17:31 🔗 SketchCow You mean 25,000 "Shard Roles"
17:32 🔗 SketchCow We don't want the same person taking two copies of the same shard, but we want people taking 5-50 shards where possible.
17:33 🔗 SketchCow Tell me when I'm wrong.
17:33 🔗 closure although.. this shard is probably smaller than average disk size. average shard is probably 11 tb.
17:34 🔗 closure SketchCow: if people take 5-50 shards (which does make sense), they are probably not storing a full tb of each shard, which some of the people are for SHARD1.
17:35 🔗 SketchCow So, two obvious thoughts I will probably restate multiple times
17:36 🔗 SketchCow 1. The amount this project backs up will always be a fraction of the Archive's stores, simply because of the realistic situation of not sharing non-public items and the fact that some of the contents are warmed over horseshit
17:36 🔗 closure 20 petabytes is 42000 x 500 gb, so 25000 people seems ballpark
17:36 🔗 SketchCow 2. The project will "succeed" when it gains enough traction that various companies donate disk server space to the project, real space. Real REAL space.
17:37 🔗 SketchCow Whether we stay down in Ham Radio Club-level adoption or SETI-level is unclear.
17:37 🔗 SketchCow What this IS doing is making git-annex better and it's making IA ask very important questions it needed to ask 10 years ago.
18:09 🔗 wp494_ (~wickedpla@[redacted]) has joined #internetarchive.bak
18:12 🔗 wp494 has quit (Ping timeout: 740 seconds)
18:13 🔗 wp494_ is now known as wp494
18:15 🔗 beardicus (~beardicus@[redacted]) has joined #internetarchive.bak
18:17 🔗 wp494 has quit (LOUD UNNECESSARY QUIT MESSAGES)
18:20 🔗 beardicus ahoy!
18:21 🔗 beardicus who shall recieve my glorious public key?
18:21 🔗 wp494 (~wickedpla@[redacted]) has joined #internetarchive.bak
18:23 🔗 closure me
18:23 🔗 closure bestow it upon me
18:23 🔗 beardicus get ready!!!!!
18:23 🔗 beardicus ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDCGLT7D7/XP2ZhTF0sEBU1L1cOWXWeEAGNa/P4OvjZfaaA4Sew6V9jTvK8fL3bX
18:23 🔗 beardicus H/z9dcSHZ4mpLj6+7RvSo2F8jpxc9ggO/dVu6bxtnicsP6Yha6yeFwu6V7n5zDOQ57YSRzo8OR3tfzumH6Gg08nzdmSkmjMrOFk34
18:23 🔗 beardicus K32sXINKkmOL9ekJneBkIx7yQ1buFXXMQl57JsKD3QE1kGWM1oWidMsNF8Q3WJ1mmE6yS3Pa489WW4B8frgfYU4UeA0UmlPBCH0cR
18:23 🔗 beardicus f3M9jKkN4ET8Q5zfeI42J6ZvO6cr2INKwDKUA1GFOk8zDeWE4JAMv9GNVqShKbv3KsbqMZEo5bclX bert@laslo
18:23 🔗 beardicus doh.
18:24 🔗 beardicus can pastebin that if you prefer. newb mistake.
18:24 🔗 closure one line plz. /msg is ok
18:27 🔗 GitHub155/#internetarchive.bak IA.BAK/pubkey 0cb8d91 Joey Hess: I am beardicus!
18:27 🔗 GitHub155/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to pubkey: http://git.io/vemK3
18:27 🔗 closure yer in
18:27 🔗 beardicus thanky hanky.
18:29 🔗 beardicus no dice closure. patience required or mis-paste?
18:31 🔗 closure seems I need to fix something on the server
18:31 🔗 closure key didn't deploy
18:31 🔗 closure try it now
18:32 🔗 beardicus success.
18:32 🔗 beardicus or wait...
18:32 🔗 closure hmm, all I did was hit the webhook. Perhaps github is being slow..
18:32 🔗 beardicus it's asking me for SHARD1's password.
18:32 🔗 beardicus so not so much.
18:34 🔗 beardicus closure: http://pastebin.com/Mh7yjGKV
18:35 🔗 closure so, it shouldn't be prompting for a password, ssh is being run with -o BatchMode=yes
18:36 🔗 closure and, it managed to connect to the server once with the key
18:36 🔗 closure interesting. Maybe we need to enable batchmode all the time, not just in the initial probe
18:37 🔗 closure muxserver_listen: link mux listener .git/annex/ssh/SHARD1@iabak.archiveteam.org.QK8zOCbtNebI7q54 => .git/annex/ssh/SHARD1@iabak.archiveteam.org: Operation not permitted
18:37 🔗 closure I have never seen ssh say that before
18:38 🔗 closure are you perhaps using this on a filesystem that does not support unix sockets?
18:38 🔗 beardicus heh. it's on an nfs share. :(
18:38 🔗 closure try this: cd shard1; git config annex.sshcaching false
18:38 🔗 closure and then re-run ./iabak
18:39 🔗 beardicus that got rid of the muxserver error, but the password prompt remains.
18:40 🔗 Kazzy beardicus: check the permissions of id_rsa in the .git/anex/ directory
18:40 🔗 closure yeah, that sounds wise
18:40 🔗 Kazzy i think it's possible that due ot being a nnfs share, permissions screw up, and ssh will (for some reason silently) refuse to use the key
18:40 🔗 Kazzy to*, nfs*
18:41 🔗 closure maybe try: ssh -i shard1/.git/annex/id_rsa -v SHARD1@iabak.archiveteam.org
18:41 🔗 closure and see why ssh is refusing to use that id_rsa
18:42 🔗 yipdw -o IdentitiesOnly yes might also be useful if this is a recent Ubuntu system and you have many SSH keys
18:42 🔗 beardicus id_rsa is rw user only. it is user "nobody" though. so that's probably it.
18:43 🔗 closure beardicus: also, I'm curious about the fifo problem. git-annex probes to see if the filesystem supports them. but its probe seems to have missed this case
18:44 🔗 closure I wonder if you could try, on the nfs share: mkfifo foo, and then if that succeeds, see if you can ln foo bar
18:44 🔗 closure beardicus: are you running this as nobody?
18:44 🔗 beardicus i am not.
18:45 🔗 closure so, nfs ate owner?!
18:45 🔗 closure remembers there was a reason he stopped using nfs in 1996
18:45 🔗 beardicus yeah. nfs is a black hole for me. can't think of a better way to permanently mount my nas on my linux box.
18:46 🔗 closure well, you can run git-annex and iabak on some nas's :)
18:46 🔗 closure but I would like to investigate the nfs problems a bit more
18:46 🔗 beardicus yeah. thought about it. i think it's x86-based.
18:47 🔗 closure it create IA.BAK/id_rsa and then copys it to IA.BAK/shard1/.git/annex/id_rsa .. is only the copy owned by the wrong user
18:47 🔗 closure ?
18:49 🔗 beardicus everything on the mount ends up as nobody/nogroup
18:49 🔗 beardicus could not hard link that fifo closure
18:50 🔗 closure hmm, it was able to use the id_rsa once
18:50 🔗 closure so you were able to create the fifo?
18:50 🔗 beardicus yes.
18:50 🔗 closure cool.. I guessed right about the problem with the fifo!
18:50 🔗 closure will fix that
18:53 🔗 beardicus guess i need to figure out all these nfs permissions baloney before i proceed. Going to try running it on the nas.
18:54 🔗 closure actually, you can hack around it
18:54 🔗 closure just do this: copy id_rsa from the nfs to your home directory someplace
18:54 🔗 closure and then, edit IA.BAK/shard1/.git/config
18:55 🔗 closure there's a line that has the path to the id_rsa file, just update that
18:55 🔗 closure we could even automate this
18:55 🔗 closure check if the file is owned by whoami and if not, put it in $HOME/
18:58 🔗 beardicus no dice.
18:58 🔗 beardicus copied id_rsa to ~/.ssh/at_id_rsa
18:58 🔗 beardicus permissions are rw only for user, owner is current user
18:59 🔗 beardicus edited shard1/.git/config
18:59 🔗 beardicus "annex-ssh-options = -i ~/.ssh/at_id_rsa"
18:59 🔗 closure not sure if ~ works in there
19:00 🔗 beardicus tried it with a full path. same password prompt.
19:02 🔗 closure is group still nobody?
19:02 🔗 beardicus nope.
19:06 🔗 beardicus poking around with the on the nas... busybox `find` command chokes on the -printf option.
19:06 🔗 beardicus this is used in sharddirs="$(find . -maxdepth 1 -name shard\* -type d -printf "%P\n")"
19:07 🔗 closure I think you'll have better luck debugging ssh
19:08 🔗 db48x (~user@[redacted]) has joined #internetarchive.bak
19:08 🔗 svchfoo2 gives channel operator status to db48x
19:10 🔗 SketchCow Shift-reload on the graph page shows the blocks getting smaller and the 4 block getting bigger.
19:13 🔗 db48x wonders why his computer was off
19:15 🔗 Kazzy beardicus: does .git/config have references to crippled mode?
19:16 🔗 beardicus Kazzy no.
19:21 🔗 londoncal has quit (Quit: Leaving...)
19:28 🔗 matthusb- (~matthusby@[redacted]) has joined #internetarchive.bak
19:33 🔗 beardicus working ok on a synology nas after working around that printf issue.
19:33 🔗 Nemo_bis (~Nemo_bis@[redacted]) has joined #internetarchive.bak
19:33 🔗 beardicus though i still had to fix permissions on the id_rsa file.
19:36 🔗 kniffy (~kniffy@[redacted]) has joined #internetarchive.bak
19:36 🔗 Nemo_bis (~Nemo_bis@[redacted]) has left #internetarchive.bak
19:48 🔗 SN4T14__ (~SN4T14@[redacted]) has joined #internetarchive.bak
19:56 🔗 SN4T14_ has quit (Ping timeout: 512 seconds)
20:13 🔗 closure beardicus: awesome!
20:13 🔗 closure beardicus: hey, you could open a pull request with the printf workaround
20:13 🔗 beardicus not sure it went all the way.
20:13 🔗 beardicus is shard1 only about 600mb?
20:13 🔗 closure no, it's much bigger
20:14 🔗 beardicus yeah. i got a message that said it's fully backed up, and then exit to shell.
20:14 🔗 closure I think we're not done with SHARD1, so probably some problem there
20:15 🔗 beardicus yeah. must've just downloaded the index.
20:17 🔗 beardicus if i run iabak again, i get a bunch of 403s.
20:18 🔗 closure well, the 403's are known
20:18 🔗 beardicus ok.
20:18 🔗 closure but, your system seems to not be noticing the files that remain to get
20:18 🔗 beardicus i'll leave it going.
20:19 🔗 beardicus i'll see if it starts downloading some stuff.
20:19 🔗 closure do you have a /usr/bin/shuf?
20:19 🔗 beardicus i don't, not on the nas.
20:19 🔗 beardicus i wonder if busybox can do that and they just didn't link it.
20:20 🔗 closure you don't need it, I was just wondering which branch it was running
20:20 🔗 beardicus no shuf in busybox i guess.
20:21 🔗 closure it sounds like it tried to download only the files that no other node has downloaded yet
20:21 🔗 closure where instead, it's supposed to download files that are not on at least 3 other nodes
20:21 🔗 closure try this: cd shard1 ; git annex find --not --copies 4
20:21 🔗 beardicus well it seemed to think all files were on four nodes.
20:21 🔗 closure see if it finds any files
20:22 🔗 closure seems so, but I don't think we're there yet are we?
20:23 🔗 db48x git annex find --copies 2 --not --copies 3 | wc -l gives me 40383
20:23 🔗 beardicus this could be it "git: 'annex' is not a git command. See 'git --help'."
20:24 🔗 closure beardicus: oh, I forgot to say, run IA.BAK/git-annex.linux/runshell first
20:26 🔗 beardicus closure that command is finding some files.
20:26 🔗 closure hmm, weird
20:26 🔗 closure try git annex get --not --copies 4
20:26 🔗 closure that's what the script is supposed to run, in your case (with no shuf)
20:27 🔗 beardicus yeah. that looks like the output i get when i rerun iabak now.
20:27 🔗 beardicus so i should probably just let it run through all the 403s until it gets to the undone stuff.
20:28 🔗 beardicus doesn't really explain why it gave up in the first place though.
20:28 🔗 closure oh, I thought you said it got through them all and said it was done
20:28 🔗 bzc6p has quit (Read error: Operation timed out)
20:28 🔗 beardicus no, it downloaded what i assume is an index of some sort, and then claimed it was all done and exited back to shell.
20:29 🔗 closure hmm
20:30 🔗 closure what kind of claim it was done?
20:30 🔗 bzc6p (~bzc6p@[redacted]) has joined #internetarchive.bak
20:30 🔗 closure this is where real output pastes are useful
20:30 🔗 closure "Wow! This shard of the IA is fully backed up now!"
20:30 🔗 beardicus http://pastebin.com/dU9FQrrv
20:32 🔗 closure yeah, looks like the git annex get step somehow didn't run at all
20:32 🔗 closure I don't understand how
20:32 🔗 closure if it's plowing through 403's now, that step is certianly running this time
20:33 🔗 db48x heh, after a sync I see only 8k left
20:34 🔗 beardicus not sure. gotta run to make dindin now. i'll try to do that pull request this evening closure.
20:35 🔗 closure there's a lot of better homes and gardens and ladies home journal in there that is all 403
20:37 🔗 SketchCow usfederalcourts, man
20:37 🔗 SketchCow Should have gone with that
20:38 🔗 sep332 $git-annex fsck --distributed
20:38 🔗 sep332 git-annex: unrecognized option `--distributed'
20:38 🔗 closure sep332: it's in the daily builds, next release
20:39 🔗 sep332 ok. would it be useful for me to get it and do that before moving this drive?
20:39 🔗 closure no, you don't need to run it
20:39 🔗 closure I mean, you can run it wherever the drive ends up, and we'll know the drive still has the files
20:40 🔗 sep332 ok thanks
20:40 🔗 closure or you can not run it, and we'll realize you are not keeping the files
20:40 🔗 db48x which is one of the scenarios we need to test
20:40 🔗 sep332 yeah i meant for testing
20:54 🔗 sankin1 has quit (Leaving.)
21:01 🔗 db48x uh oh, getting really long wait times on disk operations
22:14 🔗 closure oh boy, I can make this distributed fsck a lot more efficient
22:14 🔗 closure I think we could run fscks daily, or hourly if we wanted to, if I pull this off
22:18 🔗 closure I think it will be around 128 bytes per fsck per client
22:26 🔗 db48x closure: just read about your changes to git annex get
22:26 🔗 db48x as I understand it, "git annex get dir1 dir2" needs to download dir1/file1, dir1/file2, dir1/file3, dir2/subdir1/file1, dir2/subdir1/file2, dir2/subdir2/file1" etc, and you want them to be in the same order as the directories were specified on the command line?
22:27 🔗 closure yep
22:27 🔗 closure SketchCow: I broke the graph. Bonus: No numcopies +0 left!
22:27 🔗 db48x why even generate the list of files ahead of time? why not produce a lazy sequence of them instead?
22:28 🔗 db48x you can then produce them in sorted order, and amortize the cost over the whole time you're downloading things
22:28 🔗 closure git ls-files b c a will output a b c
22:30 🔗 closure I'd love to stream the list lazily, but I cannot when it's not generated in the right order
22:32 🔗 db48x hmm
22:34 🔗 db48x still seems like you'd be able to call ls-files on each dir individually, whenever the list runs out
22:36 🔗 pikhq closure: Did you remove the numcopies 0 stuff, or did it just happen to finally get all retrieved?
22:58 🔗 kyan (~kyan@[redacted]) has joined #internetarchive.bak
23:28 🔗 db48x pikhq: git log says he removed the unavailable ones
23:32 🔗 kniffy (~kniffy@[redacted]) has left #internetarchive.bak (Leaving)

irclogger-viewer