#internetarchive.bak 2015-05-16,Sat

↑back Search

Time Nickname Message
01:29 🔗 primus104 has quit IRC (Leaving.)
02:49 🔗 mntasauri has quit IRC (Max SendQ exceeded)
02:49 🔗 mntasauri has joined #internetarchive.bak
06:50 🔗 primus104 has joined #internetarchive.bak
07:48 🔗 primus104 has quit IRC (Leaving.)
08:06 🔗 garyrh has quit IRC (Ping timeout: 506 seconds)
08:21 🔗 primus104 has joined #internetarchive.bak
12:14 🔗 primus104 has quit IRC (Leaving.)
14:11 🔗 primus104 has joined #internetarchive.bak
14:40 🔗 tpw_rules my registration broke cause my ssh key changed
14:40 🔗 tpw_rules i tried to run ./change-email but wget said something about bad inputs
14:40 🔗 tpw_rules oh
14:42 🔗 iabak-reg 03registrar 05master 489b070 06other 10SHARD1/pubkeys registration of twatson52 on SHARD1
14:44 🔗 iabak-reg 03registrar 05master 10cecdd 06other 10SHARD5/pubkeys registration of twatson52 on SHARD5
14:46 🔗 iabak-reg 03registrar 05master 2b8ed8d 06other 10SHARD6/pubkeys registration of twatson52 on SHARD6
14:53 🔗 primus104 has quit IRC (Leaving.)
16:54 🔗 db48x tpw_rules: why did your ssh key change?
16:54 🔗 db48x tpw_rules: also, can you pastebin the wget error?
16:55 🔗 tpw_rules i reinstalled the OS and didn't keep the old one. i just forgot to generate a new public key so it couldn't upload it to the server -> replied with bad inputs
16:55 🔗 tpw_rules it's all good mow
16:55 🔗 tpw_rules now*
16:57 🔗 tpw_rules db48x: ^
17:22 🔗 iabak-reg 03registrar 05master a60e1de 06other 10SHARD4/pubkeys registration of aliz on SHARD4
17:23 🔗 db48x tpw_rules: if you kept your backup then it should have kept your key as well
18:21 🔗 primus104 has joined #internetarchive.bak
18:23 🔗 tpw_rules is there a way to make starting up iabak take less than several minutes
18:24 🔗 tpw_rules i suspect shuf(1) is reading the directory tree for some reason
18:24 🔗 tpw_rules and it does a whole lot of syncing and committing
18:26 🔗 db48x it has to sync each shard before it knows that it has up-to-date information about the number of copies of each file that exist
18:27 🔗 tpw_rules ah, fair enough
18:27 🔗 db48x shuf doesn't read any files, it just takes a stream of text lines and randomizes their order
18:27 🔗 db48x we feed shuf with a list of filenames though, and we have to generate those filenames somehow :)
18:28 🔗 tpw_rules but how does it take five minutes
18:30 🔗 tpw_rules what is it bound by?
18:45 🔗 primus104 has quit IRC (Leaving.)
19:30 🔗 db48x IO
19:30 🔗 db48x shuffling is an inherently sequential operation; you must have the entire input before you can begin shuffling it
19:31 🔗 db48x so git annex find has to crawl all over the repository looking for files that don't have enough copies, feeding the filenames to shuf
19:36 🔗 VADemon has joined #internetarchive.bak
19:51 🔗 tpw_rules disk or network? it just seems silly that there's no cache anywhere
20:19 🔗 primus104 has joined #internetarchive.bak
20:20 🔗 db48x disk io
20:20 🔗 db48x we could maintain a separate cache, certainly
20:20 🔗 db48x we could have a file in the root of the shard which contains a list of the filenames, for instance
20:42 🔗 tpw_rules will the script ever work on multiple shards at once?
20:59 🔗 db48x dunno
20:59 🔗 db48x in principle it could
21:03 🔗 tpw_rules that was meant as right now. so it won't
21:03 🔗 tpw_rules okay
21:11 🔗 db48x you can run multiple copies at the same time though
21:12 🔗 tpw_rules yeah, but they don't seem to work on multiple shards
21:12 🔗 db48x git annex and iabak are both quite careful not to step on each other
21:12 🔗 db48x no, each copy will want to finish your current shards first; then they'll pick a random new shard to work on
21:13 🔗 db48x of course, since we're still testing the software there are only two new shards to pick from at the moment :)
21:13 🔗 tpw_rules iabak often dies if i start more than a couple at once in the syncing stage because some can't lock
21:13 🔗 tpw_rules oh ok. i'm running shards 1, 5, and 6. 1 is done and it's working on 5
21:13 🔗 tpw_rules i'm experimenting with aufs
21:14 🔗 db48x hmm
21:14 🔗 db48x it should do more to avoid a failure on that initial sync...
21:25 🔗 tpw_rules http://pastie.org/private/pfunxztppb7yw1xognlnqa
21:25 🔗 db48x yep
21:27 🔗 db48x I'm testing a fix by using a shard that has an extra remote; that extra remote points to a machine which is offline
23:47 🔗 db48x tpw_rules: give that a go and see how it works for you
23:47 🔗 tpw_rules 'git pull' in the directory and see what happens?
23:48 🔗 db48x in the IA.BAK directory, then restart iabak
23:49 🔗 tpw_rules do the fsck and sync and such cronjobs leave a trail somewhere to let me know that i set them right
23:49 🔗 db48x yes
23:50 🔗 tpw_rules it says i have the cronjob installed but i'm not sure if the hourly sync is happening
23:50 🔗 db48x if your machine uses systemd, it'll put log entries in the journal; journalctl --user-unit=iabak-cronjob will show them to you
23:50 🔗 db48x if you use cron, then the same info goes to the .log file
23:50 🔗 tpw_rules is the hourly sync a cronjob or part of iabak?
23:50 🔗 db48x iabak-cronjob.log
23:51 🔗 db48x part of iabak
23:51 🔗 db48x the cronjob is daily
23:51 🔗 tpw_rules ok. so it should happen if iabak is running
23:51 🔗 tpw_rules i'll see if it happened tomorrow
23:51 🔗 tpw_rules how exactly does fscking work? does it only run if all the current shards are finished?
23:52 🔗 db48x no, it runs at the appointed time every morning regardless
23:52 🔗 tpw_rules syncing should happen*
23:52 🔗 db48x all git annex commands cooperate well with each other, so the fsck will check those files which you've downloaded
23:52 🔗 tpw_rules it said something about the cronjob does not do checksums on the wiki
23:53 🔗 tpw_rules also your patch works
23:53 🔗 db48x you can make sure the hourly sync is running by checking for the running process called iabak-hourlysync
23:53 🔗 db48x ps -AH will show you
23:53 🔗 tpw_rules looks like it. cool
23:54 🔗 tpw_rules i still wish it were faster to start up. the approach of just plugging unix utilities into each other seems to be slowing things down a bit, but it does work
23:54 🔗 tpw_rules (and that's probably a personal opinion more than anything)
23:55 🔗 db48x it'd be just as slow regardless of the implementation language, since you have to enumerate all the files in the repository, checking how many copies there are of each, then shuffle the list
23:55 🔗 db48x shuffling gets faster the closer the shard is to having four copies of everything, but the disk io is still the larger cost
23:55 🔗 db48x however, closure has already designed a replacement
23:56 🔗 db48x instead of iabak asking git annex to find files with too few copies, we'll be able to set a prefered content expression on the repositories that makes it want N of M repositories have each file
23:58 🔗 db48x it'll assign each file to a repository as a constant-time operation by taking the hash key of the file mod the number of known repositories to pick which repository should download it
23:58 🔗 tpw_rules i think git annex not having a list of files on disk is a bit silly
23:59 🔗 db48x it does have a list of files on disk
23:59 🔗 tpw_rules to be fair i probably don't know anything

irclogger-viewer