[01:29] *** primus104 has quit IRC (Leaving.)
[02:49] *** mntasauri has quit IRC (Max SendQ exceeded)
[02:49] *** mntasauri has joined #internetarchive.bak
[06:50] *** primus104 has joined #internetarchive.bak
[07:48] *** primus104 has quit IRC (Leaving.)
[08:06] *** garyrh has quit IRC (Ping timeout: 506 seconds)
[08:21] *** primus104 has joined #internetarchive.bak
[12:14] *** primus104 has quit IRC (Leaving.)
[14:11] *** primus104 has joined #internetarchive.bak
[14:40] <tpw_rules> my registration broke cause my ssh key changed
[14:40] <tpw_rules> i tried to run ./change-email but wget said something about bad inputs
[14:40] <tpw_rules> oh
[14:42] <iabak-reg> 03registrar 05master 489b070 06other 10SHARD1/pubkeys registration of twatson52 on SHARD1
[14:44] <iabak-reg> 03registrar 05master 10cecdd 06other 10SHARD5/pubkeys registration of twatson52 on SHARD5
[14:46] <iabak-reg> 03registrar 05master 2b8ed8d 06other 10SHARD6/pubkeys registration of twatson52 on SHARD6
[14:53] *** primus104 has quit IRC (Leaving.)
[16:54] <db48x> tpw_rules: why did your ssh key change?
[16:54] <db48x> tpw_rules: also, can you pastebin the wget error?
[16:55] <tpw_rules> i reinstalled the OS and didn't keep the old one. i just forgot to generate a new public key so it couldn't upload it to the server -> replied with bad inputs
[16:55] <tpw_rules> it's all good mow
[16:55] <tpw_rules> now*
[16:57] <tpw_rules> db48x: ^
[17:22] <iabak-reg> 03registrar 05master a60e1de 06other 10SHARD4/pubkeys registration of aliz on SHARD4
[17:23] <db48x> tpw_rules: if you kept your backup then it should have kept your key as well
[18:21] *** primus104 has joined #internetarchive.bak
[18:23] <tpw_rules> is there a way to make starting up iabak take less than several minutes
[18:24] <tpw_rules> i suspect shuf(1) is reading the directory tree for some reason
[18:24] <tpw_rules> and it does a whole lot of syncing and committing
[18:26] <db48x> it has to sync each shard before it knows that it has up-to-date information about the number of copies of each file that exist
[18:27] <tpw_rules> ah, fair enough
[18:27] <db48x> shuf doesn't read any files, it just takes a stream of text lines and randomizes their order
[18:27] <db48x> we feed shuf with a list of filenames though, and we have to generate those filenames somehow :)
[18:28] <tpw_rules> but how does it take five minutes
[18:30] <tpw_rules> what is it bound by?
[18:45] *** primus104 has quit IRC (Leaving.)
[19:30] <db48x> IO
[19:30] <db48x> shuffling is an inherently sequential operation; you must have the entire input before you can begin shuffling it
[19:31] <db48x> so git annex find has to crawl all over the repository looking for files that don't have enough copies, feeding the filenames to shuf
[19:36] *** VADemon has joined #internetarchive.bak
[19:51] <tpw_rules> disk or network? it just seems silly that there's no cache anywhere
[20:19] *** primus104 has joined #internetarchive.bak
[20:20] <db48x> disk io
[20:20] <db48x> we could maintain a separate cache, certainly
[20:20] <db48x> we could have a file in the root of the shard which contains a list of the filenames, for instance
[20:42] <tpw_rules> will the script ever work on multiple shards at once?
[20:59] <db48x> dunno
[20:59] <db48x> in principle it could
[21:03] <tpw_rules> that was meant as right now. so it won't
[21:03] <tpw_rules> okay
[21:11] <db48x> you can run multiple copies at the same time though
[21:12] <tpw_rules> yeah, but they don't seem to work on multiple shards
[21:12] <db48x> git annex and iabak are both quite careful not to step on each other
[21:12] <db48x> no, each copy will want to finish your current shards first; then they'll pick a random new shard to work on
[21:13] <db48x> of course, since we're still testing the software there are only two new shards to pick from at the moment :)
[21:13] <tpw_rules> iabak often dies if i start more than a couple at once in the syncing stage because some can't lock
[21:13] <tpw_rules> oh ok. i'm running shards 1, 5, and 6. 1 is done and it's working on 5
[21:13] <tpw_rules> i'm experimenting with aufs
[21:14] <db48x> hmm
[21:14] <db48x> it should do more to avoid a failure on that initial sync...
[21:25] <tpw_rules> http://pastie.org/private/pfunxztppb7yw1xognlnqa
[21:25] <db48x> yep
[21:27] <db48x> I'm testing a fix by using a shard that has an extra remote; that extra remote points to a machine which is offline
[23:47] <db48x> tpw_rules: give that a go and see how it works for you
[23:47] <tpw_rules> 'git pull' in the directory and see what happens?
[23:48] <db48x> in the IA.BAK directory, then restart iabak
[23:49] <tpw_rules> do the fsck and sync and such cronjobs leave a trail somewhere to let me know that i set them right
[23:49] <db48x> yes
[23:50] <tpw_rules> it says i have the cronjob installed but i'm not sure if the hourly sync is happening
[23:50] <db48x> if your machine uses systemd, it'll put log entries in the journal; journalctl --user-unit=iabak-cronjob will show them to you
[23:50] <db48x> if you use cron, then the same info goes to the .log file
[23:50] <tpw_rules> is the hourly sync a cronjob or part of iabak?
[23:50] <db48x> iabak-cronjob.log
[23:51] <db48x> part of iabak
[23:51] <db48x> the cronjob is daily
[23:51] <tpw_rules> ok. so it should happen if iabak is running
[23:51] <tpw_rules> i'll see if it happened tomorrow
[23:51] <tpw_rules> how exactly does fscking work? does it only run if all the current shards are finished?
[23:52] <db48x> no, it runs at the appointed time every morning regardless
[23:52] <tpw_rules> syncing should happen*
[23:52] <db48x> all git annex commands cooperate well with each other, so the fsck will check those files which you've downloaded
[23:52] <tpw_rules> it said something about the cronjob does not do checksums on the wiki
[23:53] <tpw_rules> also your patch works
[23:53] <db48x> you can make sure the hourly sync is running by checking for the running process called iabak-hourlysync
[23:53] <db48x> ps -AH will show you
[23:53] <tpw_rules> looks like it. cool
[23:54] <tpw_rules> i still wish it were faster to start up. the approach of just plugging unix utilities into each other seems to be slowing things down a bit, but it does work
[23:54] <tpw_rules> (and that's probably a personal opinion more than anything)
[23:55] <db48x> it'd be just as slow regardless of the implementation language, since you have to enumerate all the files in the repository, checking how many copies there are of each, then shuffle the list
[23:55] <db48x> shuffling gets faster the closer the shard is to having four copies of everything, but the disk io is still the larger cost
[23:55] <db48x> however, closure has already designed a replacement
[23:56] <db48x> instead of iabak asking git annex to find files with too few copies, we'll be able to set a prefered content expression on the repositories that makes it want N of M repositories have each file
[23:58] <db48x> it'll assign each file to a repository as a constant-time operation by taking the hash key of the file mod the number of known repositories to pick which repository should download it
[23:58] <tpw_rules> i think git annex not having a list of files on disk is a bit silly
[23:59] <db48x> it does have a list of files on disk
[23:59] <tpw_rules> to be fair i probably don't know anything