Time |
Nickname |
Message |
02:07
🔗
|
|
Start_ is now known as Start |
04:35
🔗
|
|
cmaldonad has joined #internetarchive.bak |
04:51
🔗
|
SketchCow |
JesseW says he doesn't have the time to be a shardmaster. |
04:51
🔗
|
SketchCow |
So we need another one, to accompany you two, I think. |
04:51
🔗
|
SketchCow |
Maybe yipdw or godane or another? |
04:51
🔗
|
cmaldonad |
what is the role of the shard master? |
04:52
🔗
|
cmaldonad |
(I don't think I have the time, but I might recruit someone) |
04:53
🔗
|
SketchCow |
The backup of the arcade requires making shards |
04:53
🔗
|
SketchCow |
archive, not arcade |
04:54
🔗
|
SketchCow |
And so people working to make sure we have a bunch stored up as time goes on |
04:54
🔗
|
cmaldonad |
yeah, I am aware of the shards concept |
04:55
🔗
|
cmaldonad |
a shard master is a shard owner, or is this a different role? |
05:05
🔗
|
db48x |
someone has to create the shards |
05:05
🔗
|
cmaldonad |
ok, I get it now |
05:06
🔗
|
db48x |
which involves picking things collections from the IA, massaging the metadata, running the scripts that to the automated stuff, making sure that they've worked correctly, improving the scripts, etc |
05:06
🔗
|
db48x |
I've just been updating http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/admin with the details |
05:07
🔗
|
cmaldonad |
reading that |
05:07
🔗
|
cmaldonad |
WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD |
05:08
🔗
|
db48x |
yahoosucks |
05:08
🔗
|
cmaldonad |
thx |
05:08
🔗
|
db48x |
you're welcome :) |
05:10
🔗
|
db48x |
nooo, my precious pull request!!1 |
05:10
🔗
|
cmaldonad |
wow cfg mgmt with haskell |
05:10
🔗
|
* |
cmaldonad vows |
05:11
🔗
|
db48x |
:) |
05:11
🔗
|
db48x |
it is pretty nifty |
05:22
🔗
|
|
Somebody1 has joined #internetarchive.bak |
05:33
🔗
|
cmaldonad |
I gotta leave, I will configure my IRC at work to be around. I can only write while at home |
05:33
🔗
|
cmaldonad |
see you tomorrow db48x |
05:35
🔗
|
db48x |
indeed, see you later |
05:41
🔗
|
SketchCow |
cmaldonad: Thanks again, feel free to use any subpages on the wiki to work out docs |
05:42
🔗
|
cmaldonad |
will do |
05:42
🔗
|
SketchCow |
Also, I'm probably going to go to datahoarders to bring in some big disk space contributors |
05:42
🔗
|
SketchCow |
Although they're likely, like all "VC", to offer a small portion (500gb) to see if it's worth their time |
05:43
🔗
|
cmaldonad |
is it too stringent to suggest putting SSL on the site? I request SSL and a wildcard cert for tqhosting.com comes up |
05:43
🔗
|
cmaldonad |
I know a local hoarder that might be interested, I will ask him if he has spare space |
05:43
🔗
|
cmaldonad |
I am not a resident of this country, so I don't hold big chunks of data.... or anything |
05:44
🔗
|
cmaldonad |
(living temporarily in Costa Rica) |
05:44
🔗
|
cmaldonad |
well temporary resident, but not a citizen, that's the most accurate description |
05:47
🔗
|
SketchCow |
At some point I'll do ssl |
05:48
🔗
|
cmaldonad |
that's fine, I guess it's temporary |
06:06
🔗
|
|
Somebody1 has quit IRC (Ping timeout: 370 seconds) |
06:21
🔗
|
|
kyan has quit IRC (Quit: Leaving) |
06:24
🔗
|
yipdw |
SketchCow: yeah, I can step in now and again |
06:25
🔗
|
yipdw |
i'm familiar with ia mine and I've seen enough code to get the hint |
06:26
🔗
|
db48x |
yipdw: awesome, send me your ed25519 public key |
06:26
🔗
|
yipdw |
db48x: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEo2mGPw2TTJMHp7G86hMBh6n9/+abzg1oXIIlkwWwzo trythil@aglarond |
06:32
🔗
|
db48x |
ok, you're set |
06:32
🔗
|
db48x |
server is iabak.archiveteam.org |
06:33
🔗
|
db48x |
in case you missed it in the scrollback, see http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/admin |
06:36
🔗
|
db48x |
I'm updating the nominations page on the wiki |
06:40
🔗
|
yipdw |
cool |
06:40
🔗
|
yipdw |
db48x: can you get me the SHA256 ECDSA host key fingerprint |
06:41
🔗
|
db48x |
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBHb0kXcrF5ThwS8wB0Hez404Zp9bz78ZxEGSqnwuF4d/N3+bymg7/HAj7l/SzRoEXKHsJ7P5320oMxBHeM16Y+k= |
06:41
🔗
|
db48x |
although that's not actually printed as a fingerprint |
06:42
🔗
|
yipdw |
I can pipe that to ssh-keygen, it's fine |
06:42
🔗
|
yipdw |
seems to chec kout |
06:42
🔗
|
db48x |
256 4e:98:3c:b9:d4:9c:66:27:e5:06:19:de:92:cc:42:b9 /etc/ssh/ssh_host_ecdsa_key.pub (ECDSA) |
06:44
🔗
|
yipdw |
hmm |
06:44
🔗
|
yipdw |
I have this really stupid idea |
06:45
🔗
|
db48x |
:) |
06:45
🔗
|
yipdw |
the collection list is (probably) the smallest input set to use, since it's curated by IA |
06:46
🔗
|
yipdw |
hmm |
06:46
🔗
|
yipdw |
so |
06:46
🔗
|
yipdw |
I need to figure out where I'm going with this |
06:47
🔗
|
yipdw |
ok yea |
06:47
🔗
|
yipdw |
what if we threw all the collections into a database, find /srv/shard to get the active ones, and use that as a basis for collection selection |
06:47
🔗
|
db48x |
we used to do that |
06:47
🔗
|
yipdw |
I think can be automated via the ia tool and some glue, one sec |
06:48
🔗
|
yipdw |
yeah |
06:48
🔗
|
db48x |
back when IA made a census for us |
06:48
🔗
|
yipdw |
right |
06:48
🔗
|
db48x |
but it seems that they don't any more |
06:49
🔗
|
yipdw |
well, to start, I guess a tool to say "collection is already active" would require no additional datastores and would be helpful |
06:49
🔗
|
yipdw |
hmm although that is tricky isn't it |
06:49
🔗
|
yipdw |
some collections are too huge for one shard |
06:50
🔗
|
db48x |
yea |
06:50
🔗
|
db48x |
the mkSHARD script had a check for that, but I took it out |
06:50
🔗
|
db48x |
because it was super slow |
06:50
🔗
|
yipdw |
right |
06:52
🔗
|
yipdw |
I'll make a few shards, watch what happens |
06:52
🔗
|
yipdw |
then I guess revisit the tool |
06:52
🔗
|
db48x |
another idea is to make a shard which indexes the other shards |
06:52
🔗
|
db48x |
SHARD0 |
06:52
🔗
|
yipdw |
hmm |
06:53
🔗
|
db48x |
put a solr database in there or something so that you can do a search any time |
06:53
🔗
|
db48x |
elastic search or whatever, I never put in the time to figure out how best to implement it |
06:53
🔗
|
yipdw |
it'd be nice if it were a git-annex repo just like the rest |
06:54
🔗
|
db48x |
exactly |
06:54
🔗
|
yipdw |
I dunno how to organize that though |
06:54
🔗
|
yipdw |
s/sh/sha/shardname1? |
06:54
🔗
|
yipdw |
er |
06:54
🔗
|
yipdw |
shard1/i/it/itemname1 or something |
06:55
🔗
|
db48x |
I was thinking just borrow a copy of IA's own index every now and then |
06:55
🔗
|
yipdw |
ah |
06:55
🔗
|
db48x |
augment it with some extra data about which shard we had put each thing into |
06:56
🔗
|
db48x |
sadly this isn't something that IA just happens to have put up as an item on IA |
06:56
🔗
|
db48x |
I'm pretty sure they use elasticsearch though, which means that anyone could download the shard and use the index |
06:57
🔗
|
db48x |
the alternative is to create our own index from the things we put into shards |
06:58
🔗
|
db48x |
still, your idea is a good one even if we don't go that far |
06:58
🔗
|
db48x |
just having some fast way to double check that we haven't put an item into two shards will be great |
06:59
🔗
|
yipdw |
something like |
06:59
🔗
|
yipdw |
yipdw@ia-bak:/srv/shard$ find . -maxdepth 2 -type d -iname 'occupywallstreet' |
06:59
🔗
|
yipdw |
seems pretty fast |
07:00
🔗
|
db48x |
yea, that'll work for now |
07:00
🔗
|
yipdw |
although uh |
07:00
🔗
|
yipdw |
one sec |
07:00
🔗
|
yipdw |
I don't think that works |
07:00
🔗
|
db48x |
it'll be way faster than building up a huge string in mkSHARD by repeated string concatenation, then calling grep |
07:00
🔗
|
yipdw |
wait no, it's fine: /srv/shard/shardN/COLLECTION/ITEM, right |
07:00
🔗
|
db48x |
yes |
07:00
🔗
|
yipdw |
ok |
07:01
🔗
|
db48x |
though items can be in multiple collections, so we want to search for the item identifier, not the collection identifier |
07:01
🔗
|
yipdw |
ah yes |
07:01
🔗
|
yipdw |
yipdw@ia-bak:/srv/shard$ time find . -maxdepth 3 -type d -iname 'rosenresli00spyr' |
07:01
🔗
|
yipdw |
./shard1/internetarchivebooks/rosenresli00spyr |
07:01
🔗
|
yipdw |
real 0m0.425s |
07:01
🔗
|
yipdw |
user 0m0.144s |
07:01
🔗
|
yipdw |
sys 0m0.250s |
07:01
🔗
|
yipdw |
I dunno, it's not horrible |
07:01
🔗
|
db48x |
no, that's great |
07:02
🔗
|
db48x |
.45 seconds is super compared to 45 minutes |
07:02
🔗
|
yipdw |
heh |
07:02
🔗
|
db48x |
do you have commit access to the IA.BAK repo? |
07:02
🔗
|
yipdw |
I should |
07:02
🔗
|
yipdw |
I do |
07:03
🔗
|
db48x |
yea, you should |
07:03
🔗
|
yipdw |
server branch, commit a find-item script or something |
07:03
🔗
|
db48x |
yea |
07:03
🔗
|
yipdw |
or are you thinking about adding it to mkSHARD |
07:04
🔗
|
db48x |
find-item script is good, as is calling it automatically from mkSHARD :) |
07:04
🔗
|
yipdw |
heh ok |
07:05
🔗
|
db48x |
grr |
07:05
🔗
|
db48x |
github is being annoying |
07:07
🔗
|
db48x |
HCross and Kaz: let yipdw or myself know your github usernames and we'll add you to the repository as well; then you can just push your changes as you make them |
07:08
🔗
|
HCross2 |
HarryC145 |
07:09
🔗
|
Kaz |
I'm just kurtmclester on github |
07:09
🔗
|
db48x |
aha, just as I closed the tab |
07:10
🔗
|
db48x |
done |
07:10
🔗
|
Kaz |
ta |
07:11
🔗
|
db48x |
you're welcome |
07:15
🔗
|
HCross2 |
Thanks |
07:17
🔗
|
db48x |
you're welcome as well :) |
07:17
🔗
|
db48x |
I'll probably be less available tomorrow as I get ready for vacation |
07:18
🔗
|
db48x |
and then I'm on a train for five days with very spotty internet connections |
07:20
🔗
|
db48x |
you guys will probably be done by the time I can check back in |
07:20
🔗
|
db48x |
the whole IA chopped up into chunks |
07:22
🔗
|
db48x |
ah |
07:22
🔗
|
db48x |
I guess the irc gateway is not very reliable |
07:22
🔗
|
db48x |
second time today it's not notified us of a commit |
07:22
🔗
|
yipdw |
it wasn't set to watch the server branch |
07:22
🔗
|
db48x |
ah |
07:22
🔗
|
db48x |
that could explain it as well |
07:23
🔗
|
db48x |
nice, you put comments |
07:23
🔗
|
yipdw |
I guess I'll see about sharding https://archive.org/details/occupywallstreet |
07:23
🔗
|
yipdw |
it seems to be not yet in there |
07:24
🔗
|
db48x |
seems like a good choice |
07:25
🔗
|
yipdw |
"There are security problems inherent in the behaviour that the POSIX standard specifies for find, which therefore cannot be fixed" |
07:25
🔗
|
yipdw |
nice |
07:26
🔗
|
yipdw |
fortunately we have no use for -exec so |
07:26
🔗
|
yipdw |
actually, we could also use locate(1) and updatedb(8) for thos |
07:26
🔗
|
yipdw |
is |
07:26
🔗
|
yipdw |
it might be faster |
07:27
🔗
|
db48x |
oooh, nice idea |
07:27
🔗
|
yipdw |
let's see how that does |
07:28
🔗
|
yipdw |
oh yeah, hm |
07:28
🔗
|
yipdw |
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND |
07:28
🔗
|
yipdw |
4605 SHARD3 30 10 2183056 1.205g 37852 S 173.6 60.2 21:43.00 git |
07:29
🔗
|
yipdw |
maybe we need to set a git maximum memory limit in whatever runs git pack-objects / git gc |
07:30
🔗
|
yipdw |
yeah, I think so -- the OOM killer has pwned a few git processes in the past |
07:30
🔗
|
db48x |
yea |
07:30
🔗
|
db48x |
though as long as it's only occasionally killed it'll be fine |
07:32
🔗
|
yipdw |
so, good news: a shard locatedb at present is 59 MB |
07:32
🔗
|
yipdw |
let's see if I can get useful benchmarks at the moment |
07:32
🔗
|
db48x |
:) |
07:32
🔗
|
db48x |
they'll be useful because they'll be measuring usage during expected load :) |
07:33
🔗
|
yipdw |
huh https://gist.github.com/yipdw/490a9148bfd8db23fc3956b9242c9aed |
07:34
🔗
|
db48x |
is that warm or cold? |
07:34
🔗
|
yipdw |
I ran both commands a few times, but I don't know what the fs cache state is like |
07:34
🔗
|
db48x |
so, warmish |
07:34
🔗
|
yipdw |
the git pack-objects processes are thrashing a lot of things |
07:35
🔗
|
db48x |
potentially warm |
07:35
🔗
|
db48x |
I just realized |
07:36
🔗
|
db48x |
the grep might have been faster than the find |
07:36
🔗
|
db48x |
as slow as it was |
07:36
🔗
|
yipdw |
was it finding multiple items? |
07:36
🔗
|
db48x |
because you might have 30k items in a collection |
07:36
🔗
|
yipdw |
yeah. well, here's option 3 |
07:37
🔗
|
yipdw |
cache the find results; they aren't going to change often (like add it as a post-receive hook or something) |
07:37
🔗
|
yipdw |
grep that |
07:37
🔗
|
db48x |
cache them? |
07:37
🔗
|
yipdw |
yeah, I'm not sure where though |
07:37
🔗
|
yipdw |
sorry, cache the result of, uh |
07:37
🔗
|
yipdw |
find /srv/shard -type d -maxdepth 3 |
07:38
🔗
|
db48x |
oh, cache the list of files |
07:38
🔗
|
yipdw |
yeah |
07:38
🔗
|
db48x |
or rather diretories |
07:38
🔗
|
db48x |
and then grep it |
07:38
🔗
|
yipdw |
yeah |
07:38
🔗
|
yipdw |
if you do that, it's great |
07:38
🔗
|
yipdw |
yipdw@ia-bak:~$ time grep 'jstor-3856989' all-items |
07:38
🔗
|
yipdw |
/srv/shard/shard5/jstor_jpoliecon/jstor-3856989 |
07:38
🔗
|
yipdw |
real 0m0.016s |
07:38
🔗
|
yipdw |
user 0m0.002s |
07:38
🔗
|
yipdw |
sys 0m0.009s |
07:38
🔗
|
yipdw |
in fact you could do that in mkSHARD every time it ran, probably |
07:39
🔗
|
yipdw |
building the directory list takes time but it's not horrible |
07:39
🔗
|
yipdw |
redirect it to a tempfile, grep it |
07:39
🔗
|
db48x |
yea, that's perfect |
07:39
🔗
|
db48x |
problem solved |
07:40
🔗
|
yipdw |
i need to figure out where in mkSHARD it did this |
07:40
🔗
|
yipdw |
though other stuff needs to be done first, brb |
07:41
🔗
|
db48x |
https://github.com/ArchiveTeam/IA.BAK/blob/ea6c479d6b7bafb78929888b4b23514bbcab7ab1/mkSHARD |
07:41
🔗
|
yipdw |
you could grep -q that and cut the real time in half, too |
07:41
🔗
|
db48x |
yep |
07:41
🔗
|
yipdw |
hmm, I wonder why it did that |
07:42
🔗
|
yipdw |
am I making a bad assumption in that the filesystem schema is always /shardN/COLLECTION/ITEM |
07:42
🔗
|
db48x |
no |
07:43
🔗
|
db48x |
you used to be able to say mkSHARD "coll1 coll2 coll3" 42 and have it make a SHARD42 out of whatever was in those three collections |
07:44
🔗
|
db48x |
I changed it so that it took a list of files instead |
07:44
🔗
|
db48x |
the tsv file that extract_collection creates |
07:44
🔗
|
db48x |
or split-collection |
07:44
🔗
|
yipdw |
oh, ok, so now we want to check each item in the file to see if it's in a shard |
07:44
🔗
|
db48x |
right |
07:45
🔗
|
yipdw |
ok |
07:46
🔗
|
yipdw |
i guess at some point we can get fancier with the indexing but this seems like it'll do at current scale |
07:47
🔗
|
yipdw |
although i'm kinda wondering like how bad would it be to just use sqlite or something for this |
07:47
🔗
|
db48x |
:) |
07:48
🔗
|
db48x |
or rg; it's supposed to be faster than grep :) |
07:50
🔗
|
yipdw |
rg is ironically harder to google for |
07:50
🔗
|
yipdw |
oh ripgrep |
07:52
🔗
|
db48x |
yea |
07:52
🔗
|
db48x |
good technical article about how it's implemented a while back |
08:02
🔗
|
|
jsp12345 has quit IRC (Read error: Connection reset by peer) |
08:03
🔗
|
|
jsp12345 has joined #internetarchive.bak |
08:07
🔗
|
yipdw |
yeah, been reading http://blog.burntsushi.net/ripgrep/ |
08:08
🔗
|
yipdw |
this has some funny synchronicity because in an attempt to further confuse myself, I've been reading about SIMD string-matching instructions |
08:08
🔗
|
yipdw |
for a different project |
08:12
🔗
|
db48x |
nice |
08:35
🔗
|
yipdw |
well, that's cool |
08:35
🔗
|
yipdw |
I'll look a bit more at mkSHARD in the morning; I need to finish some client work and go make Qt do what I want |
08:38
🔗
|
|
Kksmkrn has joined #internetarchive.bak |
09:05
🔗
|
db48x |
yipdw: have fun :) |
10:23
🔗
|
Jon |
hm managed 66G of shard3 since yesterday. it'll be a while before I fill this first 1T |
11:47
🔗
|
|
VADemon has joined #internetarchive.bak |
13:06
🔗
|
|
cmaldonad has quit IRC (Quit: This computer has gone to sleep) |
13:54
🔗
|
|
cmaldonad has joined #internetarchive.bak |
14:01
🔗
|
SketchCow |
That's fine, we'll work things out as we go. |
14:02
🔗
|
SketchCow |
For example, we might add the torrent functionality in the future. |
14:31
🔗
|
|
cmaldonad has quit IRC (Quit: This computer has gone to sleep) |
14:52
🔗
|
|
atomotic has joined #internetarchive.bak |
16:09
🔗
|
Jon |
that's be cool yeah |
16:09
🔗
|
Jon |
I guess I'm syncing from west-coast US to north east englanad |
16:09
🔗
|
Jon |
I have a friend in manchester (central-ish/north england) with most of shard3 already |
16:09
🔗
|
Jon |
should take me just under a week to fill this volume then I can open up my second terabyte |
16:26
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
18:06
🔗
|
closure |
db48x: merged propellor changes (and fixed build problems) |
18:10
🔗
|
closure |
please don't make changes directly to /usr/local/propellor on iabak; it prevents updates working |
18:22
🔗
|
closure |
db48x: ran propellor on there, the graphite-manage createsuperuser part is failing |
19:21
🔗
|
|
kyan has joined #internetarchive.bak |
19:42
🔗
|
|
kyan has quit IRC (Remote host closed the connection) |
19:51
🔗
|
HCross |
atm, each 1TB is taking a day to fill |
19:52
🔗
|
HCross |
10 days |
20:00
🔗
|
|
kyan has joined #internetarchive.bak |
20:12
🔗
|
SketchPho |
It is a process to be sure |
21:01
🔗
|
db48x |
closure: sorry about that; I tried running it directly from there to see if it was possible to update the machine that way |
21:02
🔗
|
db48x |
closure: error message? |
21:11
🔗
|
|
atomotic has joined #internetarchive.bak |
21:52
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
22:36
🔗
|
SketchPho |
The heat on this has increased |
22:36
🔗
|
SketchPho |
This channel is logged so I can't give details |
22:37
🔗
|
SketchPho |
Please continue to work in all the Realms you can. I'll write a letter to data hoarders tonight |
23:02
🔗
|
|
sep332 is now known as sep332_ |
23:05
🔗
|
closure |
db48x: you should be able to just run make from inside ~/propellor |
23:07
🔗
|
db48x |
closure: propellor gave me an error about decrypting the private data |
23:07
🔗
|
closure |
ah, right. that is indeed a problem since only I can decrypt that file |
23:07
🔗
|
closure |
probably best to untangle it from my personal config if there will be multiple admins of propellor |
23:07
🔗
|
db48x |
agreed |
23:08
🔗
|
HCross |
when this SCP from my house finishes, ill have a tar of all the .tsv and .json files for archivebot shards. Can someone please give me a hand converting these to shards |
23:08
🔗
|
HCross |
talking a good 15 mins though |
23:09
🔗
|
db48x |
HCross: sure |
23:10
🔗
|
db48x |
HCross: where are you uploading them to? |
23:10
🔗
|
HCross |
uploading it to a local server to me, and then ill wget it from there |
23:11
🔗
|
db48x |
ah |
23:11
🔗
|
db48x |
I thought you were just using scp to send it to iabak directly |
23:12
🔗
|
HCross |
far too slow to do that |
23:12
🔗
|
HCross |
its downloading now |
23:12
🔗
|
HCross |
onto iabak |
23:13
🔗
|
HCross |
check in my folder /archivebot |
23:14
🔗
|
db48x |
I see it |
23:14
🔗
|
yipdw |
heh |
23:14
🔗
|
HCross |
db48x, inside my /archivebot folder, there is a folder called /archivebot - they are all in there |
23:15
🔗
|
yipdw |
those aren't TSVs, they're JSON :P |
23:15
🔗
|
yipdw |
and the JSON isn't JSON heh |
23:15
🔗
|
HCross |
the .tsv is .json and the .json is .tsv - I got it the wrong way round |
23:15
🔗
|
yipdw |
i can see why things might have been tough |
23:16
🔗
|
HCross |
dont cat any of the .tsv files, unless you want "fun" |
23:16
🔗
|
db48x |
actually, the .json files are just the item identifiers |
23:17
🔗
|
db48x |
but it's no problem |
23:17
🔗
|
db48x |
we should probably start by renaming the files to relieve the confusion |
23:17
🔗
|
HCross |
awesome, so we can go from here |
23:18
🔗
|
db48x |
rename 's/meta/ids/' *tsv |
23:19
🔗
|
db48x |
rename 's/tsv/json/' *meta* |
23:19
🔗
|
HCross |
thanks, done |
23:20
🔗
|
HCross |
or not |
23:20
🔗
|
HCross |
oops, prob did it while someone else was |
23:21
🔗
|
db48x |
hmm |
23:21
🔗
|
db48x |
well, I wasn't :) |
23:21
🔗
|
yipdw |
not me |
23:21
🔗
|
yipdw |
I have a local copy |
23:22
🔗
|
db48x |
actually, the second one wouldn't have done anything after the first, because I misthunk |
23:22
🔗
|
db48x |
rename 's/tsv/json/' *files* |
23:23
🔗
|
HCross |
ah there we go |
23:23
🔗
|
db48x |
ok |
23:24
🔗
|
db48x |
that is a little better |
23:24
🔗
|
db48x |
at this point rename 's/meta/ids/' *meta* would help too |
23:25
🔗
|
HCross |
done |
23:25
🔗
|
db48x |
ok |
23:25
🔗
|
db48x |
so now we have archivebot-files-*.json and we need to convert them into archivebot-files-*.tsv |
23:26
🔗
|
db48x |
for f in archivebot-files-*.json; do rq -r -f get_item_files.rq "${f}" > $(basename "${f}" .json).tsv; done |
23:26
🔗
|
db48x |
that runs rq on all the files one by one |
23:27
🔗
|
db48x |
sending the output to a .tsv file |
23:27
🔗
|
yipdw |
rq or jq |
23:27
🔗
|
db48x |
jq |
23:27
🔗
|
db48x |
:) |
23:27
🔗
|
db48x |
weird that I would type rq twice |
23:28
🔗
|
HCross |
are you in the uk db48x? |
23:29
🔗
|
|
GLaDOS has joined #internetarchive.bak |
23:30
🔗
|
HCross |
sort of thing that happens when tired |
23:31
🔗
|
db48x |
no, california |
23:31
🔗
|
db48x |
yea, now we've got some real tsv files there |
23:31
🔗
|
HCross |
awesome |
23:32
🔗
|
db48x |
now you can run mkSHARD on one and see how it goes |
23:32
🔗
|
db48x |
../IA.BAK/mkSHARD archivebot-files-00.tsv |
23:33
🔗
|
|
GLaDOS has quit IRC (Client Quit) |
23:33
🔗
|
HCross |
best shard ID? |
23:33
🔗
|
|
GLaDOS has joined #internetarchive.bak |
23:34
🔗
|
db48x |
oh, uh |
23:34
🔗
|
db48x |
I think we're up to 14 or 15 now? |
23:34
🔗
|
db48x |
you guys should set up a wiki page to keep track |
23:34
🔗
|
HCross |
Ok, I remember working on 14 as an other one, so ill make this 14 instead, and then delete mine |
23:35
🔗
|
db48x |
ok |
23:35
🔗
|
db48x |
doh: -bash: bc: command not found |
23:35
🔗
|
db48x |
ok, I installed bc |
23:36
🔗
|
yipdw |
these total sizes are interesting https://gist.github.com/yipdw/8c490feaa5a48273a99c827f62b793e7 |
23:37
🔗
|
HCross |
We may want to go smaller then |
23:38
🔗
|
db48x |
7TB is pushing it a little |
23:38
🔗
|
HCross |
its a hard one - this is already 37 shards |
23:38
🔗
|
db48x |
yea |
23:38
🔗
|
yipdw |
now we reap the cost of throwing all that shit in the bot |
23:38
🔗
|
yipdw |
heh |
23:38
🔗
|
db48x |
:) |
23:39
🔗
|
db48x |
on the other hand, 7TB is not out of the question |
23:39
🔗
|
HCross |
the first shard is 8.7k files |
23:39
🔗
|
yipdw |
you might be able to split each of the > 4 TB shards down the middle and be fine |
23:40
🔗
|
yipdw |
like, I mean, literally just cut the TSV in half |
23:40
🔗
|
yipdw |
ArchiveBot WARCs are all pretty close to 5 GB each |
23:41
🔗
|
yipdw |
though I think the JSON puts all the PNGs and stuff at the end so maybe some rejiggering is useful |
23:43
🔗
|
HCross |
what is the issue with having such large shards? |
23:44
🔗
|
db48x |
it just makes it harder for a user to grab the whole shard onto one disk |
23:44
🔗
|
yipdw |
I guess that's not such a huge issue with zfs/btrfs/lvm/whatever |
23:44
🔗
|
yipdw |
though it assumes that the majority of your storage servers use those technologies |
23:44
🔗
|
yipdw |
I do know that Jason called specifically for that sort of stuff (or at least "50 TB") |
23:45
🔗
|
|
GLaDOS has quit IRC (Quit: Oh crap, I died.) |
23:45
🔗
|
yipdw |
gonna run a quick experiment |
23:45
🔗
|
HCross |
Why dont we try "see what happens" |
23:45
🔗
|
|
GLaDOS has joined #internetarchive.bak |
23:45
🔗
|
yipdw |
it's easier to reconfigure shards now than it is when they're live |
23:45
🔗
|
HCross |
I know this is critical though |
23:45
🔗
|
yipdw |
at this point, it's just text manipulation |
23:45
🔗
|
HCross |
^ |
23:46
🔗
|
db48x |
could divide the collection into slightly more shards |
23:46
🔗
|
db48x |
divide by 75 instead of by 50 perhaps |
23:47
🔗
|
yipdw |
here's one thing i'm going to try |
23:47
🔗
|
yipdw |
cat archivebot-files-*.tsv | split-by-size-column-into-2TB-or-closest |
23:47
🔗
|
yipdw |
where that second script obviously exists |
23:48
🔗
|
yipdw |
that hits the ideal shard size and allows us to use the TSV data that we have right now |
23:48
🔗
|
yipdw |
the output of that pipeline is a new shard set |
23:49
🔗
|
yipdw |
db48x: can you get Ruby into the install on this machine |
23:49
🔗
|
yipdw |
i'm one of those people |
23:49
🔗
|
db48x |
sure, go ahead and install it |
23:49
🔗
|
yipdw |
oh I thought we were doing this via propeller |
23:49
🔗
|
yipdw |
I can add that to the propeller config |
23:49
🔗
|
HCross |
yipdw, its happening now for you |
23:49
🔗
|
db48x |
yea, but we can't actually run propellor at the moment |
23:50
🔗
|
HCross |
and its done |
23:50
🔗
|
db48x |
so just install it and then modify propellor |
23:50
🔗
|
db48x |
go ahead and add the bc package to propellor while you're there |
23:50
🔗
|
yipdw |
oh ok |
23:50
🔗
|
yipdw |
cool |
23:51
🔗
|
yipdw |
one sec, I'll finish this split experiemnt first |
23:52
🔗
|
db48x |
I just had a thought |
23:52
🔗
|
db48x |
split-collection blindly splits the list of files, and it doesn't try to keep all the files in an item in the same shard |
23:54
🔗
|
db48x |
which isn't really a problem, but could annoy people |
23:55
🔗
|
yipdw |
/home/yipdw/archivebot/what for 2 TB shards |
23:56
🔗
|
yipdw |
179 :P |
23:56
🔗
|
yipdw |
one second, computing sizes |
23:56
🔗
|
HCross |
hm, do we want that many shards |
23:57
🔗
|
yipdw |
wait I fucked up |
23:57
🔗
|
yipdw |
sorry |
23:57
🔗
|
db48x |
heh |
23:57
🔗
|
yipdw |
10 ** 12 != 2 * (10 ** 12) |
23:57
🔗
|
db48x |
you deleted all the files as I was computing their sizes :) |
23:57
🔗
|
yipdw |
or do you want me to use 2^40 |
23:57
🔗
|
yipdw |
i'm gonna reignite all the holy wars |
23:58
🔗
|
db48x |
base 2, obviously |
23:58
🔗
|
yipdw |
ok, 80 shards |
23:58
🔗
|
db48x |
input.split.078.tsv: 2.001 GB |
23:58
🔗
|
db48x |
input.split.079.tsv: 2.000 GB |
23:58
🔗
|
db48x |
input.split.080.tsv: 2.008 GB |
23:58
🔗
|
yipdw |
81 |
23:58
🔗
|
yipdw |
wait what |
23:59
🔗
|
yipdw |
2.000? |
23:59
🔗
|
yipdw |
or is that the thousands . |
23:59
🔗
|
db48x |
oh, just a bug in my script |
23:59
🔗
|
yipdw |
so yeah, 81 shards if we do it at the 2 TB mark |
23:59
🔗
|
yipdw |
we can probably go for 3 |
23:59
🔗
|
db48x |
or 4 |