#internetarchive.bak 2015-05-16,Sat

↑back Search

Time	Nickname	Message
01:29 ^🔗		primus104 has quit IRC (Leaving.)
02:49 ^🔗		mntasauri has quit IRC (Max SendQ exceeded)
02:49 ^🔗		mntasauri has joined #internetarchive.bak
06:50 ^🔗		primus104 has joined #internetarchive.bak
07:48 ^🔗		primus104 has quit IRC (Leaving.)
08:06 ^🔗		garyrh has quit IRC (Ping timeout: 506 seconds)
08:21 ^🔗		primus104 has joined #internetarchive.bak
12:14 ^🔗		primus104 has quit IRC (Leaving.)
14:11 ^🔗		primus104 has joined #internetarchive.bak
14:40 ^🔗	tpw_rules	my registration broke cause my ssh key changed
14:40 ^🔗	tpw_rules	i tried to run ./change-email but wget said something about bad inputs
14:40 ^🔗	tpw_rules	oh
14:42 ^🔗	iabak-reg	03registrar 05master 489b070 06other 10SHARD1/pubkeys registration of twatson52 on SHARD1
14:44 ^🔗	iabak-reg	03registrar 05master 10cecdd 06other 10SHARD5/pubkeys registration of twatson52 on SHARD5
14:46 ^🔗	iabak-reg	03registrar 05master 2b8ed8d 06other 10SHARD6/pubkeys registration of twatson52 on SHARD6
14:53 ^🔗		primus104 has quit IRC (Leaving.)
16:54 ^🔗	db48x	tpw_rules: why did your ssh key change?
16:54 ^🔗	db48x	tpw_rules: also, can you pastebin the wget error?
16:55 ^🔗	tpw_rules	i reinstalled the OS and didn't keep the old one. i just forgot to generate a new public key so it couldn't upload it to the server -> replied with bad inputs
16:55 ^🔗	tpw_rules	it's all good mow
16:55 ^🔗	tpw_rules	now*
16:57 ^🔗	tpw_rules	db48x: ^
17:22 ^🔗	iabak-reg	03registrar 05master a60e1de 06other 10SHARD4/pubkeys registration of aliz on SHARD4
17:23 ^🔗	db48x	tpw_rules: if you kept your backup then it should have kept your key as well
18:21 ^🔗		primus104 has joined #internetarchive.bak
18:23 ^🔗	tpw_rules	is there a way to make starting up iabak take less than several minutes
18:24 ^🔗	tpw_rules	i suspect shuf(1) is reading the directory tree for some reason
18:24 ^🔗	tpw_rules	and it does a whole lot of syncing and committing
18:26 ^🔗	db48x	it has to sync each shard before it knows that it has up-to-date information about the number of copies of each file that exist
18:27 ^🔗	tpw_rules	ah, fair enough
18:27 ^🔗	db48x	shuf doesn't read any files, it just takes a stream of text lines and randomizes their order
18:27 ^🔗	db48x	we feed shuf with a list of filenames though, and we have to generate those filenames somehow :)
18:28 ^🔗	tpw_rules	but how does it take five minutes
18:30 ^🔗	tpw_rules	what is it bound by?
18:45 ^🔗		primus104 has quit IRC (Leaving.)
19:30 ^🔗	db48x	IO
19:30 ^🔗	db48x	shuffling is an inherently sequential operation; you must have the entire input before you can begin shuffling it
19:31 ^🔗	db48x	so git annex find has to crawl all over the repository looking for files that don't have enough copies, feeding the filenames to shuf
19:36 ^🔗		VADemon has joined #internetarchive.bak
19:51 ^🔗	tpw_rules	disk or network? it just seems silly that there's no cache anywhere
20:19 ^🔗		primus104 has joined #internetarchive.bak
20:20 ^🔗	db48x	disk io
20:20 ^🔗	db48x	we could maintain a separate cache, certainly
20:20 ^🔗	db48x	we could have a file in the root of the shard which contains a list of the filenames, for instance
20:42 ^🔗	tpw_rules	will the script ever work on multiple shards at once?
20:59 ^🔗	db48x	dunno
20:59 ^🔗	db48x	in principle it could
21:03 ^🔗	tpw_rules	that was meant as right now. so it won't
21:03 ^🔗	tpw_rules	okay
21:11 ^🔗	db48x	you can run multiple copies at the same time though
21:12 ^🔗	tpw_rules	yeah, but they don't seem to work on multiple shards
21:12 ^🔗	db48x	git annex and iabak are both quite careful not to step on each other
21:12 ^🔗	db48x	no, each copy will want to finish your current shards first; then they'll pick a random new shard to work on
21:13 ^🔗	db48x	of course, since we're still testing the software there are only two new shards to pick from at the moment :)
21:13 ^🔗	tpw_rules	iabak often dies if i start more than a couple at once in the syncing stage because some can't lock
21:13 ^🔗	tpw_rules	oh ok. i'm running shards 1, 5, and 6. 1 is done and it's working on 5
21:13 ^🔗	tpw_rules	i'm experimenting with aufs
21:14 ^🔗	db48x	hmm
21:14 ^🔗	db48x	it should do more to avoid a failure on that initial sync...
21:25 ^🔗	tpw_rules	http://pastie.org/private/pfunxztppb7yw1xognlnqa
21:25 ^🔗	db48x	yep
21:27 ^🔗	db48x	I'm testing a fix by using a shard that has an extra remote; that extra remote points to a machine which is offline
23:47 ^🔗	db48x	tpw_rules: give that a go and see how it works for you
23:47 ^🔗	tpw_rules	'git pull' in the directory and see what happens?
23:48 ^🔗	db48x	in the IA.BAK directory, then restart iabak
23:49 ^🔗	tpw_rules	do the fsck and sync and such cronjobs leave a trail somewhere to let me know that i set them right
23:49 ^🔗	db48x	yes
23:50 ^🔗	tpw_rules	it says i have the cronjob installed but i'm not sure if the hourly sync is happening
23:50 ^🔗	db48x	if your machine uses systemd, it'll put log entries in the journal; journalctl --user-unit=iabak-cronjob will show them to you
23:50 ^🔗	db48x	if you use cron, then the same info goes to the .log file
23:50 ^🔗	tpw_rules	is the hourly sync a cronjob or part of iabak?
23:50 ^🔗	db48x	iabak-cronjob.log
23:51 ^🔗	db48x	part of iabak
23:51 ^🔗	db48x	the cronjob is daily
23:51 ^🔗	tpw_rules	ok. so it should happen if iabak is running
23:51 ^🔗	tpw_rules	i'll see if it happened tomorrow
23:51 ^🔗	tpw_rules	how exactly does fscking work? does it only run if all the current shards are finished?
23:52 ^🔗	db48x	no, it runs at the appointed time every morning regardless
23:52 ^🔗	tpw_rules	syncing should happen*
23:52 ^🔗	db48x	all git annex commands cooperate well with each other, so the fsck will check those files which you've downloaded
23:52 ^🔗	tpw_rules	it said something about the cronjob does not do checksums on the wiki
23:53 ^🔗	tpw_rules	also your patch works
23:53 ^🔗	db48x	you can make sure the hourly sync is running by checking for the running process called iabak-hourlysync
23:53 ^🔗	db48x	ps -AH will show you
23:53 ^🔗	tpw_rules	looks like it. cool
23:54 ^🔗	tpw_rules	i still wish it were faster to start up. the approach of just plugging unix utilities into each other seems to be slowing things down a bit, but it does work
23:54 ^🔗	tpw_rules	(and that's probably a personal opinion more than anything)
23:55 ^🔗	db48x	it'd be just as slow regardless of the implementation language, since you have to enumerate all the files in the repository, checking how many copies there are of each, then shuffle the list
23:55 ^🔗	db48x	shuffling gets faster the closer the shard is to having four copies of everything, but the disk io is still the larger cost
23:55 ^🔗	db48x	however, closure has already designed a replacement
23:56 ^🔗	db48x	instead of iabak asking git annex to find files with too few copies, we'll be able to set a prefered content expression on the repositories that makes it want N of M repositories have each file
23:58 ^🔗	db48x	it'll assign each file to a repository as a constant-time operation by taking the hash key of the file mod the number of known repositories to pick which repository should download it
23:58 ^🔗	tpw_rules	i think git annex not having a list of files on disk is a bit silly
23:59 ^🔗	db48x	it does have a list of files on disk
23:59 ^🔗	tpw_rules	to be fair i probably don't know anything

irclogger-viewer