[00:00] 54 shards at 3 TB [00:00] shardsize() { echo "$f: $(echo "scale=3;$(cut -f 2 $1 | paste -sd+ | bc)/1024/1024/1024/1024" | bc) TB"; } [00:00] 39 shards at 4 TB [00:00] 4 TiB [00:00] er, 40 shards [00:00] fucking zero base [00:01] db48x: I've factored out a propellor iabak branch from joeyconfig. I can add your gpg key to it, which will let both of us access the privata data. Do you have a gpg key? [00:01] closure: I don't, actually [00:02] yipdw: 40 shards of 4*2**40 sounds perfect, actually :) [00:02] it is numerically pleasing [00:02] closure: I have thus far contented myself with a large collection of ssh keys [00:02] that sounds perfect, if you want to dump it back in my folder I can create the actual shards if you want [00:03] yipdw: do you want to stop and fix the tiny bug both of our splitting scripts share? [00:03] uh [00:03] yes, because I do not know what this bug is [00:03] it splits items across neighboring shards [00:03] oh [00:03] right [00:04] git-annex won't care [00:04] yeah that'd be good to fix [00:05] closure: how do you recommend that I create a key? [00:05] db48x: yeah [00:05] gpg --gen-key :P [00:06] HCross: yeah one second, I'm going to modify this to respect item boundaries [00:06] if I can [00:07] I can probably do some looping stuff to generate the shards [00:11] ok, I have generated a key [00:16] one sec, I didn't account for items being present in two different input TSVs [00:16] (it makes sense; split-collection doesn't account for it, so I should have seen it) [00:18] actually, huh, there's duplicate files in the TSVs [00:18] yipdw@iabak:~/archivebot$ cat archivebot-files-*.tsv | wc -l [00:18] 224275 [00:18] yipdw@iabak:~/archivebot$ cat archivebot-files-*.tsv | sort | uniq | wc -l [00:18] 173163 [00:19] perhaps it had gotten to a point where it ran out of room in a shard to put a full file [00:19] no I mean the inputs have duplicates [00:19] hrm [00:19] same MD5, same size, same URL [00:22] ok, if you eliminate the dupes and resplit you're down to 31 4 TB shards [00:23] another one written by bwn that I had sitting around [00:27] ok [00:27] yipdw@iabak:~/archivebot$ cat archivebot-files-*.tsv | sort | uniq | ./sort-by-item > lol; cd what; cat ../lol | ../split-by- [00:27] size-column-into-2TB-or-closest [00:27] item boundaries ok, dupes removed, stupid filenames check [00:27] i guess I didn't really need lol [00:28] that's great [00:28] HCross: if you'd like, you can copy from /home/yipdw/archivebot/what [00:28] I'd also appreciate some checks on the files to make sure I didn't do anything stupid [00:29] i have a check-item-boundaries script, but independent verification would be nice [00:31] yipdw, thanks. its 00:30 here, so I am going to get some sleep, but ill check it when I get up [00:31] ok [00:34] yipdw: spot checks look good [00:35] ok [00:36] I'm not sure any of these scripts I wrote have utility beyond one-offs for this shard, so I'll just keep them here for now [00:36] yup, merged all the files and done a uniq -cd and it comes back with nothing [00:36] if it turns out we get a "hey we have this tool" moment we can reconsider [00:37] Cool, ill start throwing out the shards in the morning [00:37] we'll have to redo this process at some point, since the archivebot collection has grown since that snapshot [00:37] but that's fine [00:38] yipdw: no, there are lots of big shards to split up; go ahead and commit them [00:38] once we know how, it will be so much better. This has been a huge learning process for us all [00:38] db48x: ok [00:42] can also use that jq script I just committed to first group the items of a collection into chunks and then use those to get the json metadata [00:43] ah [00:43] but it's better to have all the scripts committed so that others can come in and figure out how to make a shard [00:44] (which is basically already a software archeology task, given the lack of documentation, but we've got to start somewhere) [00:45] is the end goal something like [00:45] ia search the-things-i-want | mkSHARD [00:46] the end goal is ia | make-shards [00:47] ok [00:47] ;) [00:47] ia | dwim [00:48] ok, I must eat, and then I must pack for my journey [00:51] I've documented on wiki /admin the new propellor setup [01:03] HCross: there's a bug in my splitter script, you'll want to repull the shards [01:03] *** antomatic has quit IRC (Ping timeout: 250 seconds) [01:04] oh dear lord what have I missed [01:04] HCross: shards regenerated in /home/yipdw/archivebot/what [01:19] *** antomatic has joined #internetarchive.bak [01:30] *** Somebody has joined #internetarchive.bak [01:52] *** Somebody has quit IRC (Ping timeout: 370 seconds) [02:40] *** Somebody has joined #internetarchive.bak [02:59] *** Somebody has quit IRC (Ping timeout: 370 seconds) [03:00] *** Somebody has joined #internetarchive.bak [03:34] *** Somebody has quit IRC (Ping timeout: 370 seconds) [03:40] *** Somebody has joined #internetarchive.bak [03:55] *** kyan has quit IRC (Quit: Leaving) [04:18] *** Somebody1 has joined #internetarchive.bak [04:21] *** Somebody has quit IRC (Ping timeout: 370 seconds) [06:06] *** Somebody1 has quit IRC (Read error: Operation timed out) [06:14] *** VADemon has quit IRC (Quit: left4dead) [07:00] *** sep332 has joined #internetarchive.bak [08:18] *** Aoede has quit IRC (Ping timeout: 260 seconds) [08:23] *** Aoede has joined #internetarchive.bak [09:31] *** jsp234 has joined #internetarchive.bak [09:36] *** jsp12345 has quit IRC (Read error: Operation timed out) [09:44] *** bwn has quit IRC (Read error: Operation timed out) [09:45] *** bwn has joined #internetarchive.bak [10:21] *** PurpleSym has quit IRC (*) [10:22] *** PurpleSym has joined #internetarchive.bak [12:03] *** atomotic has joined #internetarchive.bak [12:59] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [13:15] Right, will start pushing out shards [13:20] closure, db48x it seems that network on iabak. is broken, v6 gives me network is unreachable and DNS is painfully slow to resolve [13:33] Anyone want to take Shard14 (the first ArchiveBot shard) out for a test run? [13:39] SketchPho, first shard of archivebot is in! [14:10] right [14:10] let me throw some disk at it [14:12] uh, are we missing a shard13? [14:17] that was yours [14:19] right [14:19] looks like I need to work out how to make things work [14:21] HCross: could you drop another one in as shard13 please? Would have a poke around but working at 4. missed a lot of discussion yesterday [14:22] *** joepie91 has joined #internetarchive.bak [14:34] uh, hrm [14:34] Broadcast message from systemd-journald@iabak (Sat 2016-11-12 02:23:55 EST): [14:34] [14:34] systemd[1]: Failed to run main loop: Bad address [15:02] *** VADemon has joined #internetarchive.bak [15:23] *** GLaDOS has quit IRC (Oh crap, I died.) [15:51] db48x: added your gpg key to propellor [16:03] ah, thanks [16:35] well, see you folks later [16:35] *** db48x has quit IRC (Remote host closed the connection) [16:59] *** Somebody has joined #internetarchive.bak [17:19] *** balrog has quit IRC (Read error: Operation timed out) [17:20] *** balrog has joined #internetarchive.bak [17:20] *** svchfoo3 sets mode: +o balrog [17:24] *** Somebody has quit IRC (Ping timeout: 370 seconds) [18:06] *** kyan has joined #internetarchive.bak [19:12] *** Somebody has joined #internetarchive.bak [19:30] *** kyan has quit IRC (Remote host closed the connection) [20:26] *** kyan has joined #internetarchive.bak [20:40] *** Start has quit IRC (Quit: Disconnected.) [21:58] Hurrah [22:02] Im slowly adding archivebot, bit by bit [22:31] *** Start has joined #internetarchive.bak [23:17] *** Somebody has quit IRC (Ping timeout: 370 seconds)