#internetarchive.bak 2016-11-11,Fri

↑back Search

Time Nickname Message
02:07 🔗 Start_ is now known as Start
04:35 🔗 cmaldonad has joined #internetarchive.bak
04:51 🔗 SketchCow JesseW says he doesn't have the time to be a shardmaster.
04:51 🔗 SketchCow So we need another one, to accompany you two, I think.
04:51 🔗 SketchCow Maybe yipdw or godane or another?
04:51 🔗 cmaldonad what is the role of the shard master?
04:52 🔗 cmaldonad (I don't think I have the time, but I might recruit someone)
04:53 🔗 SketchCow The backup of the arcade requires making shards
04:53 🔗 SketchCow archive, not arcade
04:54 🔗 SketchCow And so people working to make sure we have a bunch stored up as time goes on
04:54 🔗 cmaldonad yeah, I am aware of the shards concept
04:55 🔗 cmaldonad a shard master is a shard owner, or is this a different role?
05:05 🔗 db48x someone has to create the shards
05:05 🔗 cmaldonad ok, I get it now
05:06 🔗 db48x which involves picking things collections from the IA, massaging the metadata, running the scripts that to the automated stuff, making sure that they've worked correctly, improving the scripts, etc
05:06 🔗 db48x I've just been updating http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/admin with the details
05:07 🔗 cmaldonad reading that
05:07 🔗 cmaldonad WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
05:08 🔗 db48x yahoosucks
05:08 🔗 cmaldonad thx
05:08 🔗 db48x you're welcome :)
05:10 🔗 db48x nooo, my precious pull request!!1
05:10 🔗 cmaldonad wow cfg mgmt with haskell
05:10 🔗 * cmaldonad vows
05:11 🔗 db48x :)
05:11 🔗 db48x it is pretty nifty
05:22 🔗 Somebody1 has joined #internetarchive.bak
05:33 🔗 cmaldonad I gotta leave, I will configure my IRC at work to be around. I can only write while at home
05:33 🔗 cmaldonad see you tomorrow db48x
05:35 🔗 db48x indeed, see you later
05:41 🔗 SketchCow cmaldonad: Thanks again, feel free to use any subpages on the wiki to work out docs
05:42 🔗 cmaldonad will do
05:42 🔗 SketchCow Also, I'm probably going to go to datahoarders to bring in some big disk space contributors
05:42 🔗 SketchCow Although they're likely, like all "VC", to offer a small portion (500gb) to see if it's worth their time
05:43 🔗 cmaldonad is it too stringent to suggest putting SSL on the site? I request SSL and a wildcard cert for tqhosting.com comes up
05:43 🔗 cmaldonad I know a local hoarder that might be interested, I will ask him if he has spare space
05:43 🔗 cmaldonad I am not a resident of this country, so I don't hold big chunks of data.... or anything
05:44 🔗 cmaldonad (living temporarily in Costa Rica)
05:44 🔗 cmaldonad well temporary resident, but not a citizen, that's the most accurate description
05:47 🔗 SketchCow At some point I'll do ssl
05:48 🔗 cmaldonad that's fine, I guess it's temporary
06:06 🔗 Somebody1 has quit IRC (Ping timeout: 370 seconds)
06:21 🔗 kyan has quit IRC (Quit: Leaving)
06:24 🔗 yipdw SketchCow: yeah, I can step in now and again
06:25 🔗 yipdw i'm familiar with ia mine and I've seen enough code to get the hint
06:26 🔗 db48x yipdw: awesome, send me your ed25519 public key
06:26 🔗 yipdw db48x: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEo2mGPw2TTJMHp7G86hMBh6n9/+abzg1oXIIlkwWwzo trythil@aglarond
06:32 🔗 db48x ok, you're set
06:32 🔗 db48x server is iabak.archiveteam.org
06:33 🔗 db48x in case you missed it in the scrollback, see http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/admin
06:36 🔗 db48x I'm updating the nominations page on the wiki
06:40 🔗 yipdw cool
06:40 🔗 yipdw db48x: can you get me the SHA256 ECDSA host key fingerprint
06:41 🔗 db48x ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBHb0kXcrF5ThwS8wB0Hez404Zp9bz78ZxEGSqnwuF4d/N3+bymg7/HAj7l/SzRoEXKHsJ7P5320oMxBHeM16Y+k=
06:41 🔗 db48x although that's not actually printed as a fingerprint
06:42 🔗 yipdw I can pipe that to ssh-keygen, it's fine
06:42 🔗 yipdw seems to chec kout
06:42 🔗 db48x 256 4e:98:3c:b9:d4:9c:66:27:e5:06:19:de:92:cc:42:b9 /etc/ssh/ssh_host_ecdsa_key.pub (ECDSA)
06:44 🔗 yipdw hmm
06:44 🔗 yipdw I have this really stupid idea
06:45 🔗 db48x :)
06:45 🔗 yipdw the collection list is (probably) the smallest input set to use, since it's curated by IA
06:46 🔗 yipdw hmm
06:46 🔗 yipdw so
06:46 🔗 yipdw I need to figure out where I'm going with this
06:47 🔗 yipdw ok yea
06:47 🔗 yipdw what if we threw all the collections into a database, find /srv/shard to get the active ones, and use that as a basis for collection selection
06:47 🔗 db48x we used to do that
06:47 🔗 yipdw I think can be automated via the ia tool and some glue, one sec
06:48 🔗 yipdw yeah
06:48 🔗 db48x back when IA made a census for us
06:48 🔗 yipdw right
06:48 🔗 db48x but it seems that they don't any more
06:49 🔗 yipdw well, to start, I guess a tool to say "collection is already active" would require no additional datastores and would be helpful
06:49 🔗 yipdw hmm although that is tricky isn't it
06:49 🔗 yipdw some collections are too huge for one shard
06:50 🔗 db48x yea
06:50 🔗 db48x the mkSHARD script had a check for that, but I took it out
06:50 🔗 db48x because it was super slow
06:50 🔗 yipdw right
06:52 🔗 yipdw I'll make a few shards, watch what happens
06:52 🔗 yipdw then I guess revisit the tool
06:52 🔗 db48x another idea is to make a shard which indexes the other shards
06:52 🔗 db48x SHARD0
06:52 🔗 yipdw hmm
06:53 🔗 db48x put a solr database in there or something so that you can do a search any time
06:53 🔗 db48x elastic search or whatever, I never put in the time to figure out how best to implement it
06:53 🔗 yipdw it'd be nice if it were a git-annex repo just like the rest
06:54 🔗 db48x exactly
06:54 🔗 yipdw I dunno how to organize that though
06:54 🔗 yipdw s/sh/sha/shardname1?
06:54 🔗 yipdw er
06:54 🔗 yipdw shard1/i/it/itemname1 or something
06:55 🔗 db48x I was thinking just borrow a copy of IA's own index every now and then
06:55 🔗 yipdw ah
06:55 🔗 db48x augment it with some extra data about which shard we had put each thing into
06:56 🔗 db48x sadly this isn't something that IA just happens to have put up as an item on IA
06:56 🔗 db48x I'm pretty sure they use elasticsearch though, which means that anyone could download the shard and use the index
06:57 🔗 db48x the alternative is to create our own index from the things we put into shards
06:58 🔗 db48x still, your idea is a good one even if we don't go that far
06:58 🔗 db48x just having some fast way to double check that we haven't put an item into two shards will be great
06:59 🔗 yipdw something like
06:59 🔗 yipdw yipdw@ia-bak:/srv/shard$ find . -maxdepth 2 -type d -iname 'occupywallstreet'
06:59 🔗 yipdw seems pretty fast
07:00 🔗 db48x yea, that'll work for now
07:00 🔗 yipdw although uh
07:00 🔗 yipdw one sec
07:00 🔗 yipdw I don't think that works
07:00 🔗 db48x it'll be way faster than building up a huge string in mkSHARD by repeated string concatenation, then calling grep
07:00 🔗 yipdw wait no, it's fine: /srv/shard/shardN/COLLECTION/ITEM, right
07:00 🔗 db48x yes
07:00 🔗 yipdw ok
07:01 🔗 db48x though items can be in multiple collections, so we want to search for the item identifier, not the collection identifier
07:01 🔗 yipdw ah yes
07:01 🔗 yipdw yipdw@ia-bak:/srv/shard$ time find . -maxdepth 3 -type d -iname 'rosenresli00spyr'
07:01 🔗 yipdw ./shard1/internetarchivebooks/rosenresli00spyr
07:01 🔗 yipdw real 0m0.425s
07:01 🔗 yipdw user 0m0.144s
07:01 🔗 yipdw sys 0m0.250s
07:01 🔗 yipdw I dunno, it's not horrible
07:01 🔗 db48x no, that's great
07:02 🔗 db48x .45 seconds is super compared to 45 minutes
07:02 🔗 yipdw heh
07:02 🔗 db48x do you have commit access to the IA.BAK repo?
07:02 🔗 yipdw I should
07:02 🔗 yipdw I do
07:03 🔗 db48x yea, you should
07:03 🔗 yipdw server branch, commit a find-item script or something
07:03 🔗 db48x yea
07:03 🔗 yipdw or are you thinking about adding it to mkSHARD
07:04 🔗 db48x find-item script is good, as is calling it automatically from mkSHARD :)
07:04 🔗 yipdw heh ok
07:05 🔗 db48x grr
07:05 🔗 db48x github is being annoying
07:07 🔗 db48x HCross and Kaz: let yipdw or myself know your github usernames and we'll add you to the repository as well; then you can just push your changes as you make them
07:08 🔗 HCross2 HarryC145
07:09 🔗 Kaz I'm just kurtmclester on github
07:09 🔗 db48x aha, just as I closed the tab
07:10 🔗 db48x done
07:10 🔗 Kaz ta
07:11 🔗 db48x you're welcome
07:15 🔗 HCross2 Thanks
07:17 🔗 db48x you're welcome as well :)
07:17 🔗 db48x I'll probably be less available tomorrow as I get ready for vacation
07:18 🔗 db48x and then I'm on a train for five days with very spotty internet connections
07:20 🔗 db48x you guys will probably be done by the time I can check back in
07:20 🔗 db48x the whole IA chopped up into chunks
07:22 🔗 db48x ah
07:22 🔗 db48x I guess the irc gateway is not very reliable
07:22 🔗 db48x second time today it's not notified us of a commit
07:22 🔗 yipdw it wasn't set to watch the server branch
07:22 🔗 db48x ah
07:22 🔗 db48x that could explain it as well
07:23 🔗 db48x nice, you put comments
07:23 🔗 yipdw I guess I'll see about sharding https://archive.org/details/occupywallstreet
07:23 🔗 yipdw it seems to be not yet in there
07:24 🔗 db48x seems like a good choice
07:25 🔗 yipdw "There are security problems inherent in the behaviour that the POSIX standard specifies for find, which therefore cannot be fixed"
07:25 🔗 yipdw nice
07:26 🔗 yipdw fortunately we have no use for -exec so
07:26 🔗 yipdw actually, we could also use locate(1) and updatedb(8) for thos
07:26 🔗 yipdw is
07:26 🔗 yipdw it might be faster
07:27 🔗 db48x oooh, nice idea
07:27 🔗 yipdw let's see how that does
07:28 🔗 yipdw oh yeah, hm
07:28 🔗 yipdw PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
07:28 🔗 yipdw 4605 SHARD3 30 10 2183056 1.205g 37852 S 173.6 60.2 21:43.00 git
07:29 🔗 yipdw maybe we need to set a git maximum memory limit in whatever runs git pack-objects / git gc
07:30 🔗 yipdw yeah, I think so -- the OOM killer has pwned a few git processes in the past
07:30 🔗 db48x yea
07:30 🔗 db48x though as long as it's only occasionally killed it'll be fine
07:32 🔗 yipdw so, good news: a shard locatedb at present is 59 MB
07:32 🔗 yipdw let's see if I can get useful benchmarks at the moment
07:32 🔗 db48x :)
07:32 🔗 db48x they'll be useful because they'll be measuring usage during expected load :)
07:33 🔗 yipdw huh https://gist.github.com/yipdw/490a9148bfd8db23fc3956b9242c9aed
07:34 🔗 db48x is that warm or cold?
07:34 🔗 yipdw I ran both commands a few times, but I don't know what the fs cache state is like
07:34 🔗 db48x so, warmish
07:34 🔗 yipdw the git pack-objects processes are thrashing a lot of things
07:35 🔗 db48x potentially warm
07:35 🔗 db48x I just realized
07:36 🔗 db48x the grep might have been faster than the find
07:36 🔗 db48x as slow as it was
07:36 🔗 yipdw was it finding multiple items?
07:36 🔗 db48x because you might have 30k items in a collection
07:36 🔗 yipdw yeah. well, here's option 3
07:37 🔗 yipdw cache the find results; they aren't going to change often (like add it as a post-receive hook or something)
07:37 🔗 yipdw grep that
07:37 🔗 db48x cache them?
07:37 🔗 yipdw yeah, I'm not sure where though
07:37 🔗 yipdw sorry, cache the result of, uh
07:37 🔗 yipdw find /srv/shard -type d -maxdepth 3
07:38 🔗 db48x oh, cache the list of files
07:38 🔗 yipdw yeah
07:38 🔗 db48x or rather diretories
07:38 🔗 db48x and then grep it
07:38 🔗 yipdw yeah
07:38 🔗 yipdw if you do that, it's great
07:38 🔗 yipdw yipdw@ia-bak:~$ time grep 'jstor-3856989' all-items
07:38 🔗 yipdw /srv/shard/shard5/jstor_jpoliecon/jstor-3856989
07:38 🔗 yipdw real 0m0.016s
07:38 🔗 yipdw user 0m0.002s
07:38 🔗 yipdw sys 0m0.009s
07:38 🔗 yipdw in fact you could do that in mkSHARD every time it ran, probably
07:39 🔗 yipdw building the directory list takes time but it's not horrible
07:39 🔗 yipdw redirect it to a tempfile, grep it
07:39 🔗 db48x yea, that's perfect
07:39 🔗 db48x problem solved
07:40 🔗 yipdw i need to figure out where in mkSHARD it did this
07:40 🔗 yipdw though other stuff needs to be done first, brb
07:41 🔗 db48x https://github.com/ArchiveTeam/IA.BAK/blob/ea6c479d6b7bafb78929888b4b23514bbcab7ab1/mkSHARD
07:41 🔗 yipdw you could grep -q that and cut the real time in half, too
07:41 🔗 db48x yep
07:41 🔗 yipdw hmm, I wonder why it did that
07:42 🔗 yipdw am I making a bad assumption in that the filesystem schema is always /shardN/COLLECTION/ITEM
07:42 🔗 db48x no
07:43 🔗 db48x you used to be able to say mkSHARD "coll1 coll2 coll3" 42 and have it make a SHARD42 out of whatever was in those three collections
07:44 🔗 db48x I changed it so that it took a list of files instead
07:44 🔗 db48x the tsv file that extract_collection creates
07:44 🔗 db48x or split-collection
07:44 🔗 yipdw oh, ok, so now we want to check each item in the file to see if it's in a shard
07:44 🔗 db48x right
07:45 🔗 yipdw ok
07:46 🔗 yipdw i guess at some point we can get fancier with the indexing but this seems like it'll do at current scale
07:47 🔗 yipdw although i'm kinda wondering like how bad would it be to just use sqlite or something for this
07:47 🔗 db48x :)
07:48 🔗 db48x or rg; it's supposed to be faster than grep :)
07:50 🔗 yipdw rg is ironically harder to google for
07:50 🔗 yipdw oh ripgrep
07:52 🔗 db48x yea
07:52 🔗 db48x good technical article about how it's implemented a while back
08:02 🔗 jsp12345 has quit IRC (Read error: Connection reset by peer)
08:03 🔗 jsp12345 has joined #internetarchive.bak
08:07 🔗 yipdw yeah, been reading http://blog.burntsushi.net/ripgrep/
08:08 🔗 yipdw this has some funny synchronicity because in an attempt to further confuse myself, I've been reading about SIMD string-matching instructions
08:08 🔗 yipdw for a different project
08:12 🔗 db48x nice
08:35 🔗 yipdw well, that's cool
08:35 🔗 yipdw I'll look a bit more at mkSHARD in the morning; I need to finish some client work and go make Qt do what I want
08:38 🔗 Kksmkrn has joined #internetarchive.bak
09:05 🔗 db48x yipdw: have fun :)
10:23 🔗 Jon hm managed 66G of shard3 since yesterday. it'll be a while before I fill this first 1T
11:47 🔗 VADemon has joined #internetarchive.bak
13:06 🔗 cmaldonad has quit IRC (Quit: This computer has gone to sleep)
13:54 🔗 cmaldonad has joined #internetarchive.bak
14:01 🔗 SketchCow That's fine, we'll work things out as we go.
14:02 🔗 SketchCow For example, we might add the torrent functionality in the future.
14:31 🔗 cmaldonad has quit IRC (Quit: This computer has gone to sleep)
14:52 🔗 atomotic has joined #internetarchive.bak
16:09 🔗 Jon that's be cool yeah
16:09 🔗 Jon I guess I'm syncing from west-coast US to north east englanad
16:09 🔗 Jon I have a friend in manchester (central-ish/north england) with most of shard3 already
16:09 🔗 Jon should take me just under a week to fill this volume then I can open up my second terabyte
16:26 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
18:06 🔗 closure db48x: merged propellor changes (and fixed build problems)
18:10 🔗 closure please don't make changes directly to /usr/local/propellor on iabak; it prevents updates working
18:22 🔗 closure db48x: ran propellor on there, the graphite-manage createsuperuser part is failing
19:21 🔗 kyan has joined #internetarchive.bak
19:42 🔗 kyan has quit IRC (Remote host closed the connection)
19:51 🔗 HCross atm, each 1TB is taking a day to fill
19:52 🔗 HCross 10 days
20:00 🔗 kyan has joined #internetarchive.bak
20:12 🔗 SketchPho It is a process to be sure
21:01 🔗 db48x closure: sorry about that; I tried running it directly from there to see if it was possible to update the machine that way
21:02 🔗 db48x closure: error message?
21:11 🔗 atomotic has joined #internetarchive.bak
21:52 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
22:36 🔗 SketchPho The heat on this has increased
22:36 🔗 SketchPho This channel is logged so I can't give details
22:37 🔗 SketchPho Please continue to work in all the Realms you can. I'll write a letter to data hoarders tonight
23:02 🔗 sep332 is now known as sep332_
23:05 🔗 closure db48x: you should be able to just run make from inside ~/propellor
23:07 🔗 db48x closure: propellor gave me an error about decrypting the private data
23:07 🔗 closure ah, right. that is indeed a problem since only I can decrypt that file
23:07 🔗 closure probably best to untangle it from my personal config if there will be multiple admins of propellor
23:07 🔗 db48x agreed
23:08 🔗 HCross when this SCP from my house finishes, ill have a tar of all the .tsv and .json files for archivebot shards. Can someone please give me a hand converting these to shards
23:08 🔗 HCross talking a good 15 mins though
23:09 🔗 db48x HCross: sure
23:10 🔗 db48x HCross: where are you uploading them to?
23:10 🔗 HCross uploading it to a local server to me, and then ill wget it from there
23:11 🔗 db48x ah
23:11 🔗 db48x I thought you were just using scp to send it to iabak directly
23:12 🔗 HCross far too slow to do that
23:12 🔗 HCross its downloading now
23:12 🔗 HCross onto iabak
23:13 🔗 HCross check in my folder /archivebot
23:14 🔗 db48x I see it
23:14 🔗 yipdw heh
23:14 🔗 HCross db48x, inside my /archivebot folder, there is a folder called /archivebot - they are all in there
23:15 🔗 yipdw those aren't TSVs, they're JSON :P
23:15 🔗 yipdw and the JSON isn't JSON heh
23:15 🔗 HCross the .tsv is .json and the .json is .tsv - I got it the wrong way round
23:15 🔗 yipdw i can see why things might have been tough
23:16 🔗 HCross dont cat any of the .tsv files, unless you want "fun"
23:16 🔗 db48x actually, the .json files are just the item identifiers
23:17 🔗 db48x but it's no problem
23:17 🔗 db48x we should probably start by renaming the files to relieve the confusion
23:17 🔗 HCross awesome, so we can go from here
23:18 🔗 db48x rename 's/meta/ids/' *tsv
23:19 🔗 db48x rename 's/tsv/json/' *meta*
23:19 🔗 HCross thanks, done
23:20 🔗 HCross or not
23:20 🔗 HCross oops, prob did it while someone else was
23:21 🔗 db48x hmm
23:21 🔗 db48x well, I wasn't :)
23:21 🔗 yipdw not me
23:21 🔗 yipdw I have a local copy
23:22 🔗 db48x actually, the second one wouldn't have done anything after the first, because I misthunk
23:22 🔗 db48x rename 's/tsv/json/' *files*
23:23 🔗 HCross ah there we go
23:23 🔗 db48x ok
23:24 🔗 db48x that is a little better
23:24 🔗 db48x at this point rename 's/meta/ids/' *meta* would help too
23:25 🔗 HCross done
23:25 🔗 db48x ok
23:25 🔗 db48x so now we have archivebot-files-*.json and we need to convert them into archivebot-files-*.tsv
23:26 🔗 db48x for f in archivebot-files-*.json; do rq -r -f get_item_files.rq "${f}" > $(basename "${f}" .json).tsv; done
23:26 🔗 db48x that runs rq on all the files one by one
23:27 🔗 db48x sending the output to a .tsv file
23:27 🔗 yipdw rq or jq
23:27 🔗 db48x jq
23:27 🔗 db48x :)
23:27 🔗 db48x weird that I would type rq twice
23:28 🔗 HCross are you in the uk db48x?
23:29 🔗 GLaDOS has joined #internetarchive.bak
23:30 🔗 HCross sort of thing that happens when tired
23:31 🔗 db48x no, california
23:31 🔗 db48x yea, now we've got some real tsv files there
23:31 🔗 HCross awesome
23:32 🔗 db48x now you can run mkSHARD on one and see how it goes
23:32 🔗 db48x ../IA.BAK/mkSHARD archivebot-files-00.tsv
23:33 🔗 GLaDOS has quit IRC (Client Quit)
23:33 🔗 HCross best shard ID?
23:33 🔗 GLaDOS has joined #internetarchive.bak
23:34 🔗 db48x oh, uh
23:34 🔗 db48x I think we're up to 14 or 15 now?
23:34 🔗 db48x you guys should set up a wiki page to keep track
23:34 🔗 HCross Ok, I remember working on 14 as an other one, so ill make this 14 instead, and then delete mine
23:35 🔗 db48x ok
23:35 🔗 db48x doh: -bash: bc: command not found
23:35 🔗 db48x ok, I installed bc
23:36 🔗 yipdw these total sizes are interesting https://gist.github.com/yipdw/8c490feaa5a48273a99c827f62b793e7
23:37 🔗 HCross We may want to go smaller then
23:38 🔗 db48x 7TB is pushing it a little
23:38 🔗 HCross its a hard one - this is already 37 shards
23:38 🔗 db48x yea
23:38 🔗 yipdw now we reap the cost of throwing all that shit in the bot
23:38 🔗 yipdw heh
23:38 🔗 db48x :)
23:39 🔗 db48x on the other hand, 7TB is not out of the question
23:39 🔗 HCross the first shard is 8.7k files
23:39 🔗 yipdw you might be able to split each of the > 4 TB shards down the middle and be fine
23:40 🔗 yipdw like, I mean, literally just cut the TSV in half
23:40 🔗 yipdw ArchiveBot WARCs are all pretty close to 5 GB each
23:41 🔗 yipdw though I think the JSON puts all the PNGs and stuff at the end so maybe some rejiggering is useful
23:43 🔗 HCross what is the issue with having such large shards?
23:44 🔗 db48x it just makes it harder for a user to grab the whole shard onto one disk
23:44 🔗 yipdw I guess that's not such a huge issue with zfs/btrfs/lvm/whatever
23:44 🔗 yipdw though it assumes that the majority of your storage servers use those technologies
23:44 🔗 yipdw I do know that Jason called specifically for that sort of stuff (or at least "50 TB")
23:45 🔗 GLaDOS has quit IRC (Quit: Oh crap, I died.)
23:45 🔗 yipdw gonna run a quick experiment
23:45 🔗 HCross Why dont we try "see what happens"
23:45 🔗 GLaDOS has joined #internetarchive.bak
23:45 🔗 yipdw it's easier to reconfigure shards now than it is when they're live
23:45 🔗 HCross I know this is critical though
23:45 🔗 yipdw at this point, it's just text manipulation
23:45 🔗 HCross ^
23:46 🔗 db48x could divide the collection into slightly more shards
23:46 🔗 db48x divide by 75 instead of by 50 perhaps
23:47 🔗 yipdw here's one thing i'm going to try
23:47 🔗 yipdw cat archivebot-files-*.tsv | split-by-size-column-into-2TB-or-closest
23:47 🔗 yipdw where that second script obviously exists
23:48 🔗 yipdw that hits the ideal shard size and allows us to use the TSV data that we have right now
23:48 🔗 yipdw the output of that pipeline is a new shard set
23:49 🔗 yipdw db48x: can you get Ruby into the install on this machine
23:49 🔗 yipdw i'm one of those people
23:49 🔗 db48x sure, go ahead and install it
23:49 🔗 yipdw oh I thought we were doing this via propeller
23:49 🔗 yipdw I can add that to the propeller config
23:49 🔗 HCross yipdw, its happening now for you
23:49 🔗 db48x yea, but we can't actually run propellor at the moment
23:50 🔗 HCross and its done
23:50 🔗 db48x so just install it and then modify propellor
23:50 🔗 db48x go ahead and add the bc package to propellor while you're there
23:50 🔗 yipdw oh ok
23:50 🔗 yipdw cool
23:51 🔗 yipdw one sec, I'll finish this split experiemnt first
23:52 🔗 db48x I just had a thought
23:52 🔗 db48x split-collection blindly splits the list of files, and it doesn't try to keep all the files in an item in the same shard
23:54 🔗 db48x which isn't really a problem, but could annoy people
23:55 🔗 yipdw /home/yipdw/archivebot/what for 2 TB shards
23:56 🔗 yipdw 179 :P
23:56 🔗 yipdw one second, computing sizes
23:56 🔗 HCross hm, do we want that many shards
23:57 🔗 yipdw wait I fucked up
23:57 🔗 yipdw sorry
23:57 🔗 db48x heh
23:57 🔗 yipdw 10 ** 12 != 2 * (10 ** 12)
23:57 🔗 db48x you deleted all the files as I was computing their sizes :)
23:57 🔗 yipdw or do you want me to use 2^40
23:57 🔗 yipdw i'm gonna reignite all the holy wars
23:58 🔗 db48x base 2, obviously
23:58 🔗 yipdw ok, 80 shards
23:58 🔗 db48x input.split.078.tsv: 2.001 GB
23:58 🔗 db48x input.split.079.tsv: 2.000 GB
23:58 🔗 db48x input.split.080.tsv: 2.008 GB
23:58 🔗 yipdw 81
23:58 🔗 yipdw wait what
23:59 🔗 yipdw 2.000?
23:59 🔗 yipdw or is that the thousands .
23:59 🔗 db48x oh, just a bug in my script
23:59 🔗 yipdw so yeah, 81 shards if we do it at the 2 TB mark
23:59 🔗 yipdw we can probably go for 3
23:59 🔗 db48x or 4

irclogger-viewer