#internetarchive.bak 2016-11-11,Fri

↑back Search

Time	Nickname	Message
02:07 ^🔗		Start_ is now known as Start
04:35 ^🔗		cmaldonad has joined #internetarchive.bak
04:51 ^🔗	SketchCow	JesseW says he doesn't have the time to be a shardmaster.
04:51 ^🔗	SketchCow	So we need another one, to accompany you two, I think.
04:51 ^🔗	SketchCow	Maybe yipdw or godane or another?
04:51 ^🔗	cmaldonad	what is the role of the shard master?
04:52 ^🔗	cmaldonad	(I don't think I have the time, but I might recruit someone)
04:53 ^🔗	SketchCow	The backup of the arcade requires making shards
04:53 ^🔗	SketchCow	archive, not arcade
04:54 ^🔗	SketchCow	And so people working to make sure we have a bunch stored up as time goes on
04:54 ^🔗	cmaldonad	yeah, I am aware of the shards concept
04:55 ^🔗	cmaldonad	a shard master is a shard owner, or is this a different role?
05:05 ^🔗	db48x	someone has to create the shards
05:05 ^🔗	cmaldonad	ok, I get it now
05:06 ^🔗	db48x	which involves picking things collections from the IA, massaging the metadata, running the scripts that to the automated stuff, making sure that they've worked correctly, improving the scripts, etc
05:06 ^🔗	db48x	I've just been updating http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/admin with the details
05:07 ^🔗	cmaldonad	reading that
05:07 ^🔗	cmaldonad	WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
05:08 ^🔗	db48x	yahoosucks
05:08 ^🔗	cmaldonad	thx
05:08 ^🔗	db48x	you're welcome :)
05:10 ^🔗	db48x	nooo, my precious pull request!!1
05:10 ^🔗	cmaldonad	wow cfg mgmt with haskell
05:10 ^🔗	*	cmaldonad vows
05:11 ^🔗	db48x	:)
05:11 ^🔗	db48x	it is pretty nifty
05:22 ^🔗		Somebody1 has joined #internetarchive.bak
05:33 ^🔗	cmaldonad	I gotta leave, I will configure my IRC at work to be around. I can only write while at home
05:33 ^🔗	cmaldonad	see you tomorrow db48x
05:35 ^🔗	db48x	indeed, see you later
05:41 ^🔗	SketchCow	cmaldonad: Thanks again, feel free to use any subpages on the wiki to work out docs
05:42 ^🔗	cmaldonad	will do
05:42 ^🔗	SketchCow	Also, I'm probably going to go to datahoarders to bring in some big disk space contributors
05:42 ^🔗	SketchCow	Although they're likely, like all "VC", to offer a small portion (500gb) to see if it's worth their time
05:43 ^🔗	cmaldonad	is it too stringent to suggest putting SSL on the site? I request SSL and a wildcard cert for tqhosting.com comes up
05:43 ^🔗	cmaldonad	I know a local hoarder that might be interested, I will ask him if he has spare space
05:43 ^🔗	cmaldonad	I am not a resident of this country, so I don't hold big chunks of data.... or anything
05:44 ^🔗	cmaldonad	(living temporarily in Costa Rica)
05:44 ^🔗	cmaldonad	well temporary resident, but not a citizen, that's the most accurate description
05:47 ^🔗	SketchCow	At some point I'll do ssl
05:48 ^🔗	cmaldonad	that's fine, I guess it's temporary
06:06 ^🔗		Somebody1 has quit IRC (Ping timeout: 370 seconds)
06:21 ^🔗		kyan has quit IRC (Quit: Leaving)
06:24 ^🔗	yipdw	SketchCow: yeah, I can step in now and again
06:25 ^🔗	yipdw	i'm familiar with ia mine and I've seen enough code to get the hint
06:26 ^🔗	db48x	yipdw: awesome, send me your ed25519 public key
06:26 ^🔗	yipdw	db48x: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEo2mGPw2TTJMHp7G86hMBh6n9/+abzg1oXIIlkwWwzo trythil@aglarond
06:32 ^🔗	db48x	ok, you're set
06:32 ^🔗	db48x	server is iabak.archiveteam.org
06:33 ^🔗	db48x	in case you missed it in the scrollback, see http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/admin
06:36 ^🔗	db48x	I'm updating the nominations page on the wiki
06:40 ^🔗	yipdw	cool
06:40 ^🔗	yipdw	db48x: can you get me the SHA256 ECDSA host key fingerprint
06:41 ^🔗	db48x	ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBHb0kXcrF5ThwS8wB0Hez404Zp9bz78ZxEGSqnwuF4d/N3+bymg7/HAj7l/SzRoEXKHsJ7P5320oMxBHeM16Y+k=
06:41 ^🔗	db48x	although that's not actually printed as a fingerprint
06:42 ^🔗	yipdw	I can pipe that to ssh-keygen, it's fine
06:42 ^🔗	yipdw	seems to chec kout
06:42 ^🔗	db48x	256 4e:98:3c:b9:d4:9c:66:27:e5:06:19:de:92:cc:42:b9 /etc/ssh/ssh_host_ecdsa_key.pub (ECDSA)
06:44 ^🔗	yipdw	hmm
06:44 ^🔗	yipdw	I have this really stupid idea
06:45 ^🔗	db48x	:)
06:45 ^🔗	yipdw	the collection list is (probably) the smallest input set to use, since it's curated by IA
06:46 ^🔗	yipdw	hmm
06:46 ^🔗	yipdw	so
06:46 ^🔗	yipdw	I need to figure out where I'm going with this
06:47 ^🔗	yipdw	ok yea
06:47 ^🔗	yipdw	what if we threw all the collections into a database, find /srv/shard to get the active ones, and use that as a basis for collection selection
06:47 ^🔗	db48x	we used to do that
06:47 ^🔗	yipdw	I think can be automated via the ia tool and some glue, one sec
06:48 ^🔗	yipdw	yeah
06:48 ^🔗	db48x	back when IA made a census for us
06:48 ^🔗	yipdw	right
06:48 ^🔗	db48x	but it seems that they don't any more
06:49 ^🔗	yipdw	well, to start, I guess a tool to say "collection is already active" would require no additional datastores and would be helpful
06:49 ^🔗	yipdw	hmm although that is tricky isn't it
06:49 ^🔗	yipdw	some collections are too huge for one shard
06:50 ^🔗	db48x	yea
06:50 ^🔗	db48x	the mkSHARD script had a check for that, but I took it out
06:50 ^🔗	db48x	because it was super slow
06:50 ^🔗	yipdw	right
06:52 ^🔗	yipdw	I'll make a few shards, watch what happens
06:52 ^🔗	yipdw	then I guess revisit the tool
06:52 ^🔗	db48x	another idea is to make a shard which indexes the other shards
06:52 ^🔗	db48x	SHARD0
06:52 ^🔗	yipdw	hmm
06:53 ^🔗	db48x	put a solr database in there or something so that you can do a search any time
06:53 ^🔗	db48x	elastic search or whatever, I never put in the time to figure out how best to implement it
06:53 ^🔗	yipdw	it'd be nice if it were a git-annex repo just like the rest
06:54 ^🔗	db48x	exactly
06:54 ^🔗	yipdw	I dunno how to organize that though
06:54 ^🔗	yipdw	s/sh/sha/shardname1?
06:54 ^🔗	yipdw	er
06:54 ^🔗	yipdw	shard1/i/it/itemname1 or something
06:55 ^🔗	db48x	I was thinking just borrow a copy of IA's own index every now and then
06:55 ^🔗	yipdw	ah
06:55 ^🔗	db48x	augment it with some extra data about which shard we had put each thing into
06:56 ^🔗	db48x	sadly this isn't something that IA just happens to have put up as an item on IA
06:56 ^🔗	db48x	I'm pretty sure they use elasticsearch though, which means that anyone could download the shard and use the index
06:57 ^🔗	db48x	the alternative is to create our own index from the things we put into shards
06:58 ^🔗	db48x	still, your idea is a good one even if we don't go that far
06:58 ^🔗	db48x	just having some fast way to double check that we haven't put an item into two shards will be great
06:59 ^🔗	yipdw	something like
06:59 ^🔗	yipdw	yipdw@ia-bak:/srv/shard$ find . -maxdepth 2 -type d -iname 'occupywallstreet'
06:59 ^🔗	yipdw	seems pretty fast
07:00 ^🔗	db48x	yea, that'll work for now
07:00 ^🔗	yipdw	although uh
07:00 ^🔗	yipdw	one sec
07:00 ^🔗	yipdw	I don't think that works
07:00 ^🔗	db48x	it'll be way faster than building up a huge string in mkSHARD by repeated string concatenation, then calling grep
07:00 ^🔗	yipdw	wait no, it's fine: /srv/shard/shardN/COLLECTION/ITEM, right
07:00 ^🔗	db48x	yes
07:00 ^🔗	yipdw	ok
07:01 ^🔗	db48x	though items can be in multiple collections, so we want to search for the item identifier, not the collection identifier
07:01 ^🔗	yipdw	ah yes
07:01 ^🔗	yipdw	yipdw@ia-bak:/srv/shard$ time find . -maxdepth 3 -type d -iname 'rosenresli00spyr'
07:01 ^🔗	yipdw	./shard1/internetarchivebooks/rosenresli00spyr
07:01 ^🔗	yipdw	real 0m0.425s
07:01 ^🔗	yipdw	user 0m0.144s
07:01 ^🔗	yipdw	sys 0m0.250s
07:01 ^🔗	yipdw	I dunno, it's not horrible
07:01 ^🔗	db48x	no, that's great
07:02 ^🔗	db48x	.45 seconds is super compared to 45 minutes
07:02 ^🔗	yipdw	heh
07:02 ^🔗	db48x	do you have commit access to the IA.BAK repo?
07:02 ^🔗	yipdw	I should
07:02 ^🔗	yipdw	I do
07:03 ^🔗	db48x	yea, you should
07:03 ^🔗	yipdw	server branch, commit a find-item script or something
07:03 ^🔗	db48x	yea
07:03 ^🔗	yipdw	or are you thinking about adding it to mkSHARD
07:04 ^🔗	db48x	find-item script is good, as is calling it automatically from mkSHARD :)
07:04 ^🔗	yipdw	heh ok
07:05 ^🔗	db48x	grr
07:05 ^🔗	db48x	github is being annoying
07:07 ^🔗	db48x	HCross and Kaz: let yipdw or myself know your github usernames and we'll add you to the repository as well; then you can just push your changes as you make them
07:08 ^🔗	HCross2	HarryC145
07:09 ^🔗	Kaz	I'm just kurtmclester on github
07:09 ^🔗	db48x	aha, just as I closed the tab
07:10 ^🔗	db48x	done
07:10 ^🔗	Kaz	ta
07:11 ^🔗	db48x	you're welcome
07:15 ^🔗	HCross2	Thanks
07:17 ^🔗	db48x	you're welcome as well :)
07:17 ^🔗	db48x	I'll probably be less available tomorrow as I get ready for vacation
07:18 ^🔗	db48x	and then I'm on a train for five days with very spotty internet connections
07:20 ^🔗	db48x	you guys will probably be done by the time I can check back in
07:20 ^🔗	db48x	the whole IA chopped up into chunks
07:22 ^🔗	db48x	ah
07:22 ^🔗	db48x	I guess the irc gateway is not very reliable
07:22 ^🔗	db48x	second time today it's not notified us of a commit
07:22 ^🔗	yipdw	it wasn't set to watch the server branch
07:22 ^🔗	db48x	ah
07:22 ^🔗	db48x	that could explain it as well
07:23 ^🔗	db48x	nice, you put comments
07:23 ^🔗	yipdw	I guess I'll see about sharding https://archive.org/details/occupywallstreet
07:23 ^🔗	yipdw	it seems to be not yet in there
07:24 ^🔗	db48x	seems like a good choice
07:25 ^🔗	yipdw	"There are security problems inherent in the behaviour that the POSIX standard specifies for find, which therefore cannot be fixed"
07:25 ^🔗	yipdw	nice
07:26 ^🔗	yipdw	fortunately we have no use for -exec so
07:26 ^🔗	yipdw	actually, we could also use locate(1) and updatedb(8) for thos
07:26 ^🔗	yipdw	is
07:26 ^🔗	yipdw	it might be faster
07:27 ^🔗	db48x	oooh, nice idea
07:27 ^🔗	yipdw	let's see how that does
07:28 ^🔗	yipdw	oh yeah, hm
07:28 ^🔗	yipdw	PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
07:28 ^🔗	yipdw	4605 SHARD3 30 10 2183056 1.205g 37852 S 173.6 60.2 21:43.00 git
07:29 ^🔗	yipdw	maybe we need to set a git maximum memory limit in whatever runs git pack-objects / git gc
07:30 ^🔗	yipdw	yeah, I think so -- the OOM killer has pwned a few git processes in the past
07:30 ^🔗	db48x	yea
07:30 ^🔗	db48x	though as long as it's only occasionally killed it'll be fine
07:32 ^🔗	yipdw	so, good news: a shard locatedb at present is 59 MB
07:32 ^🔗	yipdw	let's see if I can get useful benchmarks at the moment
07:32 ^🔗	db48x	:)
07:32 ^🔗	db48x	they'll be useful because they'll be measuring usage during expected load :)
07:33 ^🔗	yipdw	huh https://gist.github.com/yipdw/490a9148bfd8db23fc3956b9242c9aed
07:34 ^🔗	db48x	is that warm or cold?
07:34 ^🔗	yipdw	I ran both commands a few times, but I don't know what the fs cache state is like
07:34 ^🔗	db48x	so, warmish
07:34 ^🔗	yipdw	the git pack-objects processes are thrashing a lot of things
07:35 ^🔗	db48x	potentially warm
07:35 ^🔗	db48x	I just realized
07:36 ^🔗	db48x	the grep might have been faster than the find
07:36 ^🔗	db48x	as slow as it was
07:36 ^🔗	yipdw	was it finding multiple items?
07:36 ^🔗	db48x	because you might have 30k items in a collection
07:36 ^🔗	yipdw	yeah. well, here's option 3
07:37 ^🔗	yipdw	cache the find results; they aren't going to change often (like add it as a post-receive hook or something)
07:37 ^🔗	yipdw	grep that
07:37 ^🔗	db48x	cache them?
07:37 ^🔗	yipdw	yeah, I'm not sure where though
07:37 ^🔗	yipdw	sorry, cache the result of, uh
07:37 ^🔗	yipdw	find /srv/shard -type d -maxdepth 3
07:38 ^🔗	db48x	oh, cache the list of files
07:38 ^🔗	yipdw	yeah
07:38 ^🔗	db48x	or rather diretories
07:38 ^🔗	db48x	and then grep it
07:38 ^🔗	yipdw	yeah
07:38 ^🔗	yipdw	if you do that, it's great
07:38 ^🔗	yipdw	yipdw@ia-bak:~$ time grep 'jstor-3856989' all-items
07:38 ^🔗	yipdw	/srv/shard/shard5/jstor_jpoliecon/jstor-3856989
07:38 ^🔗	yipdw	real 0m0.016s
07:38 ^🔗	yipdw	user 0m0.002s
07:38 ^🔗	yipdw	sys 0m0.009s
07:38 ^🔗	yipdw	in fact you could do that in mkSHARD every time it ran, probably
07:39 ^🔗	yipdw	building the directory list takes time but it's not horrible
07:39 ^🔗	yipdw	redirect it to a tempfile, grep it
07:39 ^🔗	db48x	yea, that's perfect
07:39 ^🔗	db48x	problem solved
07:40 ^🔗	yipdw	i need to figure out where in mkSHARD it did this
07:40 ^🔗	yipdw	though other stuff needs to be done first, brb
07:41 ^🔗	db48x	https://github.com/ArchiveTeam/IA.BAK/blob/ea6c479d6b7bafb78929888b4b23514bbcab7ab1/mkSHARD
07:41 ^🔗	yipdw	you could grep -q that and cut the real time in half, too
07:41 ^🔗	db48x	yep
07:41 ^🔗	yipdw	hmm, I wonder why it did that
07:42 ^🔗	yipdw	am I making a bad assumption in that the filesystem schema is always /shardN/COLLECTION/ITEM
07:42 ^🔗	db48x	no
07:43 ^🔗	db48x	you used to be able to say mkSHARD "coll1 coll2 coll3" 42 and have it make a SHARD42 out of whatever was in those three collections
07:44 ^🔗	db48x	I changed it so that it took a list of files instead
07:44 ^🔗	db48x	the tsv file that extract_collection creates
07:44 ^🔗	db48x	or split-collection
07:44 ^🔗	yipdw	oh, ok, so now we want to check each item in the file to see if it's in a shard
07:44 ^🔗	db48x	right
07:45 ^🔗	yipdw	ok
07:46 ^🔗	yipdw	i guess at some point we can get fancier with the indexing but this seems like it'll do at current scale
07:47 ^🔗	yipdw	although i'm kinda wondering like how bad would it be to just use sqlite or something for this
07:47 ^🔗	db48x	:)
07:48 ^🔗	db48x	or rg; it's supposed to be faster than grep :)
07:50 ^🔗	yipdw	rg is ironically harder to google for
07:50 ^🔗	yipdw	oh ripgrep
07:52 ^🔗	db48x	yea
07:52 ^🔗	db48x	good technical article about how it's implemented a while back
08:02 ^🔗		jsp12345 has quit IRC (Read error: Connection reset by peer)
08:03 ^🔗		jsp12345 has joined #internetarchive.bak
08:07 ^🔗	yipdw	yeah, been reading http://blog.burntsushi.net/ripgrep/
08:08 ^🔗	yipdw	this has some funny synchronicity because in an attempt to further confuse myself, I've been reading about SIMD string-matching instructions
08:08 ^🔗	yipdw	for a different project
08:12 ^🔗	db48x	nice
08:35 ^🔗	yipdw	well, that's cool
08:35 ^🔗	yipdw	I'll look a bit more at mkSHARD in the morning; I need to finish some client work and go make Qt do what I want
08:38 ^🔗		Kksmkrn has joined #internetarchive.bak
09:05 ^🔗	db48x	yipdw: have fun :)
10:23 ^🔗	Jon	hm managed 66G of shard3 since yesterday. it'll be a while before I fill this first 1T
11:47 ^🔗		VADemon has joined #internetarchive.bak
13:06 ^🔗		cmaldonad has quit IRC (Quit: This computer has gone to sleep)
13:54 ^🔗		cmaldonad has joined #internetarchive.bak
14:01 ^🔗	SketchCow	That's fine, we'll work things out as we go.
14:02 ^🔗	SketchCow	For example, we might add the torrent functionality in the future.
14:31 ^🔗		cmaldonad has quit IRC (Quit: This computer has gone to sleep)
14:52 ^🔗		atomotic has joined #internetarchive.bak
16:09 ^🔗	Jon	that's be cool yeah
16:09 ^🔗	Jon	I guess I'm syncing from west-coast US to north east englanad
16:09 ^🔗	Jon	I have a friend in manchester (central-ish/north england) with most of shard3 already
16:09 ^🔗	Jon	should take me just under a week to fill this volume then I can open up my second terabyte
16:26 ^🔗		atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
18:06 ^🔗	closure	db48x: merged propellor changes (and fixed build problems)
18:10 ^🔗	closure	please don't make changes directly to /usr/local/propellor on iabak; it prevents updates working
18:22 ^🔗	closure	db48x: ran propellor on there, the graphite-manage createsuperuser part is failing
19:21 ^🔗		kyan has joined #internetarchive.bak
19:42 ^🔗		kyan has quit IRC (Remote host closed the connection)
19:51 ^🔗	HCross	atm, each 1TB is taking a day to fill
19:52 ^🔗	HCross	10 days
20:00 ^🔗		kyan has joined #internetarchive.bak
20:12 ^🔗	SketchPho	It is a process to be sure
21:01 ^🔗	db48x	closure: sorry about that; I tried running it directly from there to see if it was possible to update the machine that way
21:02 ^🔗	db48x	closure: error message?
21:11 ^🔗		atomotic has joined #internetarchive.bak
21:52 ^🔗		atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
22:36 ^🔗	SketchPho	The heat on this has increased
22:36 ^🔗	SketchPho	This channel is logged so I can't give details
22:37 ^🔗	SketchPho	Please continue to work in all the Realms you can. I'll write a letter to data hoarders tonight
23:02 ^🔗		sep332 is now known as sep332_
23:05 ^🔗	closure	db48x: you should be able to just run make from inside ~/propellor
23:07 ^🔗	db48x	closure: propellor gave me an error about decrypting the private data
23:07 ^🔗	closure	ah, right. that is indeed a problem since only I can decrypt that file
23:07 ^🔗	closure	probably best to untangle it from my personal config if there will be multiple admins of propellor
23:07 ^🔗	db48x	agreed
23:08 ^🔗	HCross	when this SCP from my house finishes, ill have a tar of all the .tsv and .json files for archivebot shards. Can someone please give me a hand converting these to shards
23:08 ^🔗	HCross	talking a good 15 mins though
23:09 ^🔗	db48x	HCross: sure
23:10 ^🔗	db48x	HCross: where are you uploading them to?
23:10 ^🔗	HCross	uploading it to a local server to me, and then ill wget it from there
23:11 ^🔗	db48x	ah
23:11 ^🔗	db48x	I thought you were just using scp to send it to iabak directly
23:12 ^🔗	HCross	far too slow to do that
23:12 ^🔗	HCross	its downloading now
23:12 ^🔗	HCross	onto iabak
23:13 ^🔗	HCross	check in my folder /archivebot
23:14 ^🔗	db48x	I see it
23:14 ^🔗	yipdw	heh
23:14 ^🔗	HCross	db48x, inside my /archivebot folder, there is a folder called /archivebot - they are all in there
23:15 ^🔗	yipdw	those aren't TSVs, they're JSON :P
23:15 ^🔗	yipdw	and the JSON isn't JSON heh
23:15 ^🔗	HCross	the .tsv is .json and the .json is .tsv - I got it the wrong way round
23:15 ^🔗	yipdw	i can see why things might have been tough
23:16 ^🔗	HCross	dont cat any of the .tsv files, unless you want "fun"
23:16 ^🔗	db48x	actually, the .json files are just the item identifiers
23:17 ^🔗	db48x	but it's no problem
23:17 ^🔗	db48x	we should probably start by renaming the files to relieve the confusion
23:17 ^🔗	HCross	awesome, so we can go from here
23:18 ^🔗	db48x	rename 's/meta/ids/' *tsv
23:19 ^🔗	db48x	rename 's/tsv/json/' meta
23:19 ^🔗	HCross	thanks, done
23:20 ^🔗	HCross	or not
23:20 ^🔗	HCross	oops, prob did it while someone else was
23:21 ^🔗	db48x	hmm
23:21 ^🔗	db48x	well, I wasn't :)
23:21 ^🔗	yipdw	not me
23:21 ^🔗	yipdw	I have a local copy
23:22 ^🔗	db48x	actually, the second one wouldn't have done anything after the first, because I misthunk
23:22 ^🔗	db48x	rename 's/tsv/json/' files
23:23 ^🔗	HCross	ah there we go
23:23 ^🔗	db48x	ok
23:24 ^🔗	db48x	that is a little better
23:24 ^🔗	db48x	at this point rename 's/meta/ids/' meta would help too
23:25 ^🔗	HCross	done
23:25 ^🔗	db48x	ok
23:25 ^🔗	db48x	so now we have archivebot-files-.json and we need to convert them into archivebot-files-.tsv
23:26 ^🔗	db48x	for f in archivebot-files-*.json; do rq -r -f get_item_files.rq "${f}" > $(basename "${f}" .json).tsv; done
23:26 ^🔗	db48x	that runs rq on all the files one by one
23:27 ^🔗	db48x	sending the output to a .tsv file
23:27 ^🔗	yipdw	rq or jq
23:27 ^🔗	db48x	jq
23:27 ^🔗	db48x	:)
23:27 ^🔗	db48x	weird that I would type rq twice
23:28 ^🔗	HCross	are you in the uk db48x?
23:29 ^🔗		GLaDOS has joined #internetarchive.bak
23:30 ^🔗	HCross	sort of thing that happens when tired
23:31 ^🔗	db48x	no, california
23:31 ^🔗	db48x	yea, now we've got some real tsv files there
23:31 ^🔗	HCross	awesome
23:32 ^🔗	db48x	now you can run mkSHARD on one and see how it goes
23:32 ^🔗	db48x	../IA.BAK/mkSHARD archivebot-files-00.tsv
23:33 ^🔗		GLaDOS has quit IRC (Client Quit)
23:33 ^🔗	HCross	best shard ID?
23:33 ^🔗		GLaDOS has joined #internetarchive.bak
23:34 ^🔗	db48x	oh, uh
23:34 ^🔗	db48x	I think we're up to 14 or 15 now?
23:34 ^🔗	db48x	you guys should set up a wiki page to keep track
23:34 ^🔗	HCross	Ok, I remember working on 14 as an other one, so ill make this 14 instead, and then delete mine
23:35 ^🔗	db48x	ok
23:35 ^🔗	db48x	doh: -bash: bc: command not found
23:35 ^🔗	db48x	ok, I installed bc
23:36 ^🔗	yipdw	these total sizes are interesting https://gist.github.com/yipdw/8c490feaa5a48273a99c827f62b793e7
23:37 ^🔗	HCross	We may want to go smaller then
23:38 ^🔗	db48x	7TB is pushing it a little
23:38 ^🔗	HCross	its a hard one - this is already 37 shards
23:38 ^🔗	db48x	yea
23:38 ^🔗	yipdw	now we reap the cost of throwing all that shit in the bot
23:38 ^🔗	yipdw	heh
23:38 ^🔗	db48x	:)
23:39 ^🔗	db48x	on the other hand, 7TB is not out of the question
23:39 ^🔗	HCross	the first shard is 8.7k files
23:39 ^🔗	yipdw	you might be able to split each of the > 4 TB shards down the middle and be fine
23:40 ^🔗	yipdw	like, I mean, literally just cut the TSV in half
23:40 ^🔗	yipdw	ArchiveBot WARCs are all pretty close to 5 GB each
23:41 ^🔗	yipdw	though I think the JSON puts all the PNGs and stuff at the end so maybe some rejiggering is useful
23:43 ^🔗	HCross	what is the issue with having such large shards?
23:44 ^🔗	db48x	it just makes it harder for a user to grab the whole shard onto one disk
23:44 ^🔗	yipdw	I guess that's not such a huge issue with zfs/btrfs/lvm/whatever
23:44 ^🔗	yipdw	though it assumes that the majority of your storage servers use those technologies
23:44 ^🔗	yipdw	I do know that Jason called specifically for that sort of stuff (or at least "50 TB")
23:45 ^🔗		GLaDOS has quit IRC (Quit: Oh crap, I died.)
23:45 ^🔗	yipdw	gonna run a quick experiment
23:45 ^🔗	HCross	Why dont we try "see what happens"
23:45 ^🔗		GLaDOS has joined #internetarchive.bak
23:45 ^🔗	yipdw	it's easier to reconfigure shards now than it is when they're live
23:45 ^🔗	HCross	I know this is critical though
23:45 ^🔗	yipdw	at this point, it's just text manipulation
23:45 ^🔗	HCross	^
23:46 ^🔗	db48x	could divide the collection into slightly more shards
23:46 ^🔗	db48x	divide by 75 instead of by 50 perhaps
23:47 ^🔗	yipdw	here's one thing i'm going to try
23:47 ^🔗	yipdw	cat archivebot-files-*.tsv \| split-by-size-column-into-2TB-or-closest
23:47 ^🔗	yipdw	where that second script obviously exists
23:48 ^🔗	yipdw	that hits the ideal shard size and allows us to use the TSV data that we have right now
23:48 ^🔗	yipdw	the output of that pipeline is a new shard set
23:49 ^🔗	yipdw	db48x: can you get Ruby into the install on this machine
23:49 ^🔗	yipdw	i'm one of those people
23:49 ^🔗	db48x	sure, go ahead and install it
23:49 ^🔗	yipdw	oh I thought we were doing this via propeller
23:49 ^🔗	yipdw	I can add that to the propeller config
23:49 ^🔗	HCross	yipdw, its happening now for you
23:49 ^🔗	db48x	yea, but we can't actually run propellor at the moment
23:50 ^🔗	HCross	and its done
23:50 ^🔗	db48x	so just install it and then modify propellor
23:50 ^🔗	db48x	go ahead and add the bc package to propellor while you're there
23:50 ^🔗	yipdw	oh ok
23:50 ^🔗	yipdw	cool
23:51 ^🔗	yipdw	one sec, I'll finish this split experiemnt first
23:52 ^🔗	db48x	I just had a thought
23:52 ^🔗	db48x	split-collection blindly splits the list of files, and it doesn't try to keep all the files in an item in the same shard
23:54 ^🔗	db48x	which isn't really a problem, but could annoy people
23:55 ^🔗	yipdw	/home/yipdw/archivebot/what for 2 TB shards
23:56 ^🔗	yipdw	179 :P
23:56 ^🔗	yipdw	one second, computing sizes
23:56 ^🔗	HCross	hm, do we want that many shards
23:57 ^🔗	yipdw	wait I fucked up
23:57 ^🔗	yipdw	sorry
23:57 ^🔗	db48x	heh
23:57 ^🔗	yipdw	10 ** 12 != 2 * (10 ** 12)
23:57 ^🔗	db48x	you deleted all the files as I was computing their sizes :)
23:57 ^🔗	yipdw	or do you want me to use 2^40
23:57 ^🔗	yipdw	i'm gonna reignite all the holy wars
23:58 ^🔗	db48x	base 2, obviously
23:58 ^🔗	yipdw	ok, 80 shards
23:58 ^🔗	db48x	input.split.078.tsv: 2.001 GB
23:58 ^🔗	db48x	input.split.079.tsv: 2.000 GB
23:58 ^🔗	db48x	input.split.080.tsv: 2.008 GB
23:58 ^🔗	yipdw	81
23:58 ^🔗	yipdw	wait what
23:59 ^🔗	yipdw	2.000?
23:59 ^🔗	yipdw	or is that the thousands .
23:59 ^🔗	db48x	oh, just a bug in my script
23:59 ^🔗	yipdw	so yeah, 81 shards if we do it at the 2 TB mark
23:59 ^🔗	yipdw	we can probably go for 3
23:59 ^🔗	db48x	or 4

irclogger-viewer