[02:07] *** Start_ is now known as Start [04:35] *** cmaldonad has joined #internetarchive.bak [04:51] JesseW says he doesn't have the time to be a shardmaster. [04:51] So we need another one, to accompany you two, I think. [04:51] Maybe yipdw or godane or another? [04:51] what is the role of the shard master? [04:52] (I don't think I have the time, but I might recruit someone) [04:53] The backup of the arcade requires making shards [04:53] archive, not arcade [04:54] And so people working to make sure we have a bunch stored up as time goes on [04:54] yeah, I am aware of the shards concept [04:55] a shard master is a shard owner, or is this a different role? [05:05] someone has to create the shards [05:05] ok, I get it now [05:06] which involves picking things collections from the IA, massaging the metadata, running the scripts that to the automated stuff, making sure that they've worked correctly, improving the scripts, etc [05:06] I've just been updating http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/admin with the details [05:07] reading that [05:07] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [05:08] yahoosucks [05:08] thx [05:08] you're welcome :) [05:10] nooo, my precious pull request!!1 [05:10] wow cfg mgmt with haskell [05:10] * cmaldonad vows [05:11] :) [05:11] it is pretty nifty [05:22] *** Somebody1 has joined #internetarchive.bak [05:33] I gotta leave, I will configure my IRC at work to be around. I can only write while at home [05:33] see you tomorrow db48x [05:35] indeed, see you later [05:41] cmaldonad: Thanks again, feel free to use any subpages on the wiki to work out docs [05:42] will do [05:42] Also, I'm probably going to go to datahoarders to bring in some big disk space contributors [05:42] Although they're likely, like all "VC", to offer a small portion (500gb) to see if it's worth their time [05:43] is it too stringent to suggest putting SSL on the site? I request SSL and a wildcard cert for tqhosting.com comes up [05:43] I know a local hoarder that might be interested, I will ask him if he has spare space [05:43] I am not a resident of this country, so I don't hold big chunks of data.... or anything [05:44] (living temporarily in Costa Rica) [05:44] well temporary resident, but not a citizen, that's the most accurate description [05:47] At some point I'll do ssl [05:48] that's fine, I guess it's temporary [06:06] *** Somebody1 has quit IRC (Ping timeout: 370 seconds) [06:21] *** kyan has quit IRC (Quit: Leaving) [06:24] SketchCow: yeah, I can step in now and again [06:25] i'm familiar with ia mine and I've seen enough code to get the hint [06:26] yipdw: awesome, send me your ed25519 public key [06:26] db48x: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEo2mGPw2TTJMHp7G86hMBh6n9/+abzg1oXIIlkwWwzo trythil@aglarond [06:32] ok, you're set [06:32] server is iabak.archiveteam.org [06:33] in case you missed it in the scrollback, see http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/admin [06:36] I'm updating the nominations page on the wiki [06:40] cool [06:40] db48x: can you get me the SHA256 ECDSA host key fingerprint [06:41] ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBHb0kXcrF5ThwS8wB0Hez404Zp9bz78ZxEGSqnwuF4d/N3+bymg7/HAj7l/SzRoEXKHsJ7P5320oMxBHeM16Y+k= [06:41] although that's not actually printed as a fingerprint [06:42] I can pipe that to ssh-keygen, it's fine [06:42] seems to chec kout [06:42] 256 4e:98:3c:b9:d4:9c:66:27:e5:06:19:de:92:cc:42:b9 /etc/ssh/ssh_host_ecdsa_key.pub (ECDSA) [06:44] hmm [06:44] I have this really stupid idea [06:45] :) [06:45] the collection list is (probably) the smallest input set to use, since it's curated by IA [06:46] hmm [06:46] so [06:46] I need to figure out where I'm going with this [06:47] ok yea [06:47] what if we threw all the collections into a database, find /srv/shard to get the active ones, and use that as a basis for collection selection [06:47] we used to do that [06:47] I think can be automated via the ia tool and some glue, one sec [06:48] yeah [06:48] back when IA made a census for us [06:48] right [06:48] but it seems that they don't any more [06:49] well, to start, I guess a tool to say "collection is already active" would require no additional datastores and would be helpful [06:49] hmm although that is tricky isn't it [06:49] some collections are too huge for one shard [06:50] yea [06:50] the mkSHARD script had a check for that, but I took it out [06:50] because it was super slow [06:50] right [06:52] I'll make a few shards, watch what happens [06:52] then I guess revisit the tool [06:52] another idea is to make a shard which indexes the other shards [06:52] SHARD0 [06:52] hmm [06:53] put a solr database in there or something so that you can do a search any time [06:53] elastic search or whatever, I never put in the time to figure out how best to implement it [06:53] it'd be nice if it were a git-annex repo just like the rest [06:54] exactly [06:54] I dunno how to organize that though [06:54] s/sh/sha/shardname1? [06:54] er [06:54] shard1/i/it/itemname1 or something [06:55] I was thinking just borrow a copy of IA's own index every now and then [06:55] ah [06:55] augment it with some extra data about which shard we had put each thing into [06:56] sadly this isn't something that IA just happens to have put up as an item on IA [06:56] I'm pretty sure they use elasticsearch though, which means that anyone could download the shard and use the index [06:57] the alternative is to create our own index from the things we put into shards [06:58] still, your idea is a good one even if we don't go that far [06:58] just having some fast way to double check that we haven't put an item into two shards will be great [06:59] something like [06:59] yipdw@ia-bak:/srv/shard$ find . -maxdepth 2 -type d -iname 'occupywallstreet' [06:59] seems pretty fast [07:00] yea, that'll work for now [07:00] although uh [07:00] one sec [07:00] I don't think that works [07:00] it'll be way faster than building up a huge string in mkSHARD by repeated string concatenation, then calling grep [07:00] wait no, it's fine: /srv/shard/shardN/COLLECTION/ITEM, right [07:00] yes [07:00] ok [07:01] though items can be in multiple collections, so we want to search for the item identifier, not the collection identifier [07:01] ah yes [07:01] yipdw@ia-bak:/srv/shard$ time find . -maxdepth 3 -type d -iname 'rosenresli00spyr' [07:01] ./shard1/internetarchivebooks/rosenresli00spyr [07:01] real 0m0.425s [07:01] user 0m0.144s [07:01] sys 0m0.250s [07:01] I dunno, it's not horrible [07:01] no, that's great [07:02] .45 seconds is super compared to 45 minutes [07:02] heh [07:02] do you have commit access to the IA.BAK repo? [07:02] I should [07:02] I do [07:03] yea, you should [07:03] server branch, commit a find-item script or something [07:03] yea [07:03] or are you thinking about adding it to mkSHARD [07:04] find-item script is good, as is calling it automatically from mkSHARD :) [07:04] heh ok [07:05] grr [07:05] github is being annoying [07:07] HCross and Kaz: let yipdw or myself know your github usernames and we'll add you to the repository as well; then you can just push your changes as you make them [07:08] HarryC145 [07:09] I'm just kurtmclester on github [07:09] aha, just as I closed the tab [07:10] done [07:10] ta [07:11] you're welcome [07:15] Thanks [07:17] you're welcome as well :) [07:17] I'll probably be less available tomorrow as I get ready for vacation [07:18] and then I'm on a train for five days with very spotty internet connections [07:20] you guys will probably be done by the time I can check back in [07:20] the whole IA chopped up into chunks [07:22] ah [07:22] I guess the irc gateway is not very reliable [07:22] second time today it's not notified us of a commit [07:22] it wasn't set to watch the server branch [07:22] ah [07:22] that could explain it as well [07:23] nice, you put comments [07:23] I guess I'll see about sharding https://archive.org/details/occupywallstreet [07:23] it seems to be not yet in there [07:24] seems like a good choice [07:25] "There are security problems inherent in the behaviour that the POSIX standard specifies for find, which therefore cannot be fixed" [07:25] nice [07:26] fortunately we have no use for -exec so [07:26] actually, we could also use locate(1) and updatedb(8) for thos [07:26] is [07:26] it might be faster [07:27] oooh, nice idea [07:27] let's see how that does [07:28] oh yeah, hm [07:28] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND [07:28] 4605 SHARD3 30 10 2183056 1.205g 37852 S 173.6 60.2 21:43.00 git [07:29] maybe we need to set a git maximum memory limit in whatever runs git pack-objects / git gc [07:30] yeah, I think so -- the OOM killer has pwned a few git processes in the past [07:30] yea [07:30] though as long as it's only occasionally killed it'll be fine [07:32] so, good news: a shard locatedb at present is 59 MB [07:32] let's see if I can get useful benchmarks at the moment [07:32] :) [07:32] they'll be useful because they'll be measuring usage during expected load :) [07:33] huh https://gist.github.com/yipdw/490a9148bfd8db23fc3956b9242c9aed [07:34] is that warm or cold? [07:34] I ran both commands a few times, but I don't know what the fs cache state is like [07:34] so, warmish [07:34] the git pack-objects processes are thrashing a lot of things [07:35] potentially warm [07:35] I just realized [07:36] the grep might have been faster than the find [07:36] as slow as it was [07:36] was it finding multiple items? [07:36] because you might have 30k items in a collection [07:36] yeah. well, here's option 3 [07:37] cache the find results; they aren't going to change often (like add it as a post-receive hook or something) [07:37] grep that [07:37] cache them? [07:37] yeah, I'm not sure where though [07:37] sorry, cache the result of, uh [07:37] find /srv/shard -type d -maxdepth 3 [07:38] oh, cache the list of files [07:38] yeah [07:38] or rather diretories [07:38] and then grep it [07:38] yeah [07:38] if you do that, it's great [07:38] yipdw@ia-bak:~$ time grep 'jstor-3856989' all-items [07:38] /srv/shard/shard5/jstor_jpoliecon/jstor-3856989 [07:38] real 0m0.016s [07:38] user 0m0.002s [07:38] sys 0m0.009s [07:38] in fact you could do that in mkSHARD every time it ran, probably [07:39] building the directory list takes time but it's not horrible [07:39] redirect it to a tempfile, grep it [07:39] yea, that's perfect [07:39] problem solved [07:40] i need to figure out where in mkSHARD it did this [07:40] though other stuff needs to be done first, brb [07:41] https://github.com/ArchiveTeam/IA.BAK/blob/ea6c479d6b7bafb78929888b4b23514bbcab7ab1/mkSHARD [07:41] you could grep -q that and cut the real time in half, too [07:41] yep [07:41] hmm, I wonder why it did that [07:42] am I making a bad assumption in that the filesystem schema is always /shardN/COLLECTION/ITEM [07:42] no [07:43] you used to be able to say mkSHARD "coll1 coll2 coll3" 42 and have it make a SHARD42 out of whatever was in those three collections [07:44] I changed it so that it took a list of files instead [07:44] the tsv file that extract_collection creates [07:44] or split-collection [07:44] oh, ok, so now we want to check each item in the file to see if it's in a shard [07:44] right [07:45] ok [07:46] i guess at some point we can get fancier with the indexing but this seems like it'll do at current scale [07:47] although i'm kinda wondering like how bad would it be to just use sqlite or something for this [07:47] :) [07:48] or rg; it's supposed to be faster than grep :) [07:50] rg is ironically harder to google for [07:50] oh ripgrep [07:52] yea [07:52] good technical article about how it's implemented a while back [08:02] *** jsp12345 has quit IRC (Read error: Connection reset by peer) [08:03] *** jsp12345 has joined #internetarchive.bak [08:07] yeah, been reading http://blog.burntsushi.net/ripgrep/ [08:08] this has some funny synchronicity because in an attempt to further confuse myself, I've been reading about SIMD string-matching instructions [08:08] for a different project [08:12] nice [08:35] well, that's cool [08:35] I'll look a bit more at mkSHARD in the morning; I need to finish some client work and go make Qt do what I want [08:38] *** Kksmkrn has joined #internetarchive.bak [09:05] yipdw: have fun :) [10:23] hm managed 66G of shard3 since yesterday. it'll be a while before I fill this first 1T [11:47] *** VADemon has joined #internetarchive.bak [13:06] *** cmaldonad has quit IRC (Quit: This computer has gone to sleep) [13:54] *** cmaldonad has joined #internetarchive.bak [14:01] That's fine, we'll work things out as we go. [14:02] For example, we might add the torrent functionality in the future. [14:31] *** cmaldonad has quit IRC (Quit: This computer has gone to sleep) [14:52] *** atomotic has joined #internetarchive.bak [16:09] that's be cool yeah [16:09] I guess I'm syncing from west-coast US to north east englanad [16:09] I have a friend in manchester (central-ish/north england) with most of shard3 already [16:09] should take me just under a week to fill this volume then I can open up my second terabyte [16:26] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [18:06] db48x: merged propellor changes (and fixed build problems) [18:10] please don't make changes directly to /usr/local/propellor on iabak; it prevents updates working [18:22] db48x: ran propellor on there, the graphite-manage createsuperuser part is failing [19:21] *** kyan has joined #internetarchive.bak [19:42] *** kyan has quit IRC (Remote host closed the connection) [19:51] atm, each 1TB is taking a day to fill [19:52] 10 days [20:00] *** kyan has joined #internetarchive.bak [20:12] It is a process to be sure [21:01] closure: sorry about that; I tried running it directly from there to see if it was possible to update the machine that way [21:02] closure: error message? [21:11] *** atomotic has joined #internetarchive.bak [21:52] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [22:36] The heat on this has increased [22:36] This channel is logged so I can't give details [22:37] Please continue to work in all the Realms you can. I'll write a letter to data hoarders tonight [23:02] *** sep332 is now known as sep332_ [23:05] db48x: you should be able to just run make from inside ~/propellor [23:07] closure: propellor gave me an error about decrypting the private data [23:07] ah, right. that is indeed a problem since only I can decrypt that file [23:07] probably best to untangle it from my personal config if there will be multiple admins of propellor [23:07] agreed [23:08] when this SCP from my house finishes, ill have a tar of all the .tsv and .json files for archivebot shards. Can someone please give me a hand converting these to shards [23:08] talking a good 15 mins though [23:09] HCross: sure [23:10] HCross: where are you uploading them to? [23:10] uploading it to a local server to me, and then ill wget it from there [23:11] ah [23:11] I thought you were just using scp to send it to iabak directly [23:12] far too slow to do that [23:12] its downloading now [23:12] onto iabak [23:13] check in my folder /archivebot [23:14] I see it [23:14] heh [23:14] db48x, inside my /archivebot folder, there is a folder called /archivebot - they are all in there [23:15] those aren't TSVs, they're JSON :P [23:15] and the JSON isn't JSON heh [23:15] the .tsv is .json and the .json is .tsv - I got it the wrong way round [23:15] i can see why things might have been tough [23:16] dont cat any of the .tsv files, unless you want "fun" [23:16] actually, the .json files are just the item identifiers [23:17] but it's no problem [23:17] we should probably start by renaming the files to relieve the confusion [23:17] awesome, so we can go from here [23:18] rename 's/meta/ids/' *tsv [23:19] rename 's/tsv/json/' *meta* [23:19] thanks, done [23:20] or not [23:20] oops, prob did it while someone else was [23:21] hmm [23:21] well, I wasn't :) [23:21] not me [23:21] I have a local copy [23:22] actually, the second one wouldn't have done anything after the first, because I misthunk [23:22] rename 's/tsv/json/' *files* [23:23] ah there we go [23:23] ok [23:24] that is a little better [23:24] at this point rename 's/meta/ids/' *meta* would help too [23:25] done [23:25] ok [23:25] so now we have archivebot-files-*.json and we need to convert them into archivebot-files-*.tsv [23:26] for f in archivebot-files-*.json; do rq -r -f get_item_files.rq "${f}" > $(basename "${f}" .json).tsv; done [23:26] that runs rq on all the files one by one [23:27] sending the output to a .tsv file [23:27] rq or jq [23:27] jq [23:27] :) [23:27] weird that I would type rq twice [23:28] are you in the uk db48x? [23:29] *** GLaDOS has joined #internetarchive.bak [23:30] sort of thing that happens when tired [23:31] no, california [23:31] yea, now we've got some real tsv files there [23:31] awesome [23:32] now you can run mkSHARD on one and see how it goes [23:32] ../IA.BAK/mkSHARD archivebot-files-00.tsv [23:33] *** GLaDOS has quit IRC (Client Quit) [23:33] best shard ID? [23:33] *** GLaDOS has joined #internetarchive.bak [23:34] oh, uh [23:34] I think we're up to 14 or 15 now? [23:34] you guys should set up a wiki page to keep track [23:34] Ok, I remember working on 14 as an other one, so ill make this 14 instead, and then delete mine [23:35] ok [23:35] doh: -bash: bc: command not found [23:35] ok, I installed bc [23:36] these total sizes are interesting https://gist.github.com/yipdw/8c490feaa5a48273a99c827f62b793e7 [23:37] We may want to go smaller then [23:38] 7TB is pushing it a little [23:38] its a hard one - this is already 37 shards [23:38] yea [23:38] now we reap the cost of throwing all that shit in the bot [23:38] heh [23:38] :) [23:39] on the other hand, 7TB is not out of the question [23:39] the first shard is 8.7k files [23:39] you might be able to split each of the > 4 TB shards down the middle and be fine [23:40] like, I mean, literally just cut the TSV in half [23:40] ArchiveBot WARCs are all pretty close to 5 GB each [23:41] though I think the JSON puts all the PNGs and stuff at the end so maybe some rejiggering is useful [23:43] what is the issue with having such large shards? [23:44] it just makes it harder for a user to grab the whole shard onto one disk [23:44] I guess that's not such a huge issue with zfs/btrfs/lvm/whatever [23:44] though it assumes that the majority of your storage servers use those technologies [23:44] I do know that Jason called specifically for that sort of stuff (or at least "50 TB") [23:45] *** GLaDOS has quit IRC (Quit: Oh crap, I died.) [23:45] gonna run a quick experiment [23:45] Why dont we try "see what happens" [23:45] *** GLaDOS has joined #internetarchive.bak [23:45] it's easier to reconfigure shards now than it is when they're live [23:45] I know this is critical though [23:45] at this point, it's just text manipulation [23:45] ^ [23:46] could divide the collection into slightly more shards [23:46] divide by 75 instead of by 50 perhaps [23:47] here's one thing i'm going to try [23:47] cat archivebot-files-*.tsv | split-by-size-column-into-2TB-or-closest [23:47] where that second script obviously exists [23:48] that hits the ideal shard size and allows us to use the TSV data that we have right now [23:48] the output of that pipeline is a new shard set [23:49] db48x: can you get Ruby into the install on this machine [23:49] i'm one of those people [23:49] sure, go ahead and install it [23:49] oh I thought we were doing this via propeller [23:49] I can add that to the propeller config [23:49] yipdw, its happening now for you [23:49] yea, but we can't actually run propellor at the moment [23:50] and its done [23:50] so just install it and then modify propellor [23:50] go ahead and add the bc package to propellor while you're there [23:50] oh ok [23:50] cool [23:51] one sec, I'll finish this split experiemnt first [23:52] I just had a thought [23:52] split-collection blindly splits the list of files, and it doesn't try to keep all the files in an item in the same shard [23:54] which isn't really a problem, but could annoy people [23:55] /home/yipdw/archivebot/what for 2 TB shards [23:56] 179 :P [23:56] one second, computing sizes [23:56] hm, do we want that many shards [23:57] wait I fucked up [23:57] sorry [23:57] heh [23:57] 10 ** 12 != 2 * (10 ** 12) [23:57] you deleted all the files as I was computing their sizes :) [23:57] or do you want me to use 2^40 [23:57] i'm gonna reignite all the holy wars [23:58] base 2, obviously [23:58] ok, 80 shards [23:58] input.split.078.tsv: 2.001 GB [23:58] input.split.079.tsv: 2.000 GB [23:58] input.split.080.tsv: 2.008 GB [23:58] 81 [23:58] wait what [23:59] 2.000? [23:59] or is that the thousands . [23:59] oh, just a bug in my script [23:59] so yeah, 81 shards if we do it at the 2 TB mark [23:59] we can probably go for 3 [23:59] or 4