#internetarchive.bak 2016-11-12,Sat

↑back Search

Time Nickname Message
00:00 🔗 yipdw 54 shards at 3 TB
00:00 🔗 db48x shardsize() { echo "$f: $(echo "scale=3;$(cut -f 2 $1 | paste -sd+ | bc)/1024/1024/1024/1024" | bc) TB"; }
00:00 🔗 yipdw 39 shards at 4 TB
00:00 🔗 yipdw 4 TiB
00:00 🔗 yipdw er, 40 shards
00:00 🔗 yipdw fucking zero base
00:01 🔗 closure db48x: I've factored out a propellor iabak branch from joeyconfig. I can add your gpg key to it, which will let both of us access the privata data. Do you have a gpg key?
00:01 🔗 db48x closure: I don't, actually
00:02 🔗 db48x yipdw: 40 shards of 4*2**40 sounds perfect, actually :)
00:02 🔗 yipdw it is numerically pleasing
00:02 🔗 db48x closure: I have thus far contented myself with a large collection of ssh keys
00:02 🔗 HCross that sounds perfect, if you want to dump it back in my folder I can create the actual shards if you want
00:03 🔗 db48x yipdw: do you want to stop and fix the tiny bug both of our splitting scripts share?
00:03 🔗 yipdw uh
00:03 🔗 yipdw yes, because I do not know what this bug is
00:03 🔗 db48x it splits items across neighboring shards
00:03 🔗 yipdw oh
00:03 🔗 yipdw right
00:04 🔗 db48x git-annex won't care
00:04 🔗 yipdw yeah that'd be good to fix
00:05 🔗 db48x closure: how do you recommend that I create a key?
00:05 🔗 closure db48x: yeah
00:05 🔗 closure gpg --gen-key :P
00:06 🔗 yipdw HCross: yeah one second, I'm going to modify this to respect item boundaries
00:06 🔗 yipdw if I can
00:07 🔗 HCross I can probably do some looping stuff to generate the shards
00:11 🔗 db48x ok, I have generated a key
00:16 🔗 yipdw one sec, I didn't account for items being present in two different input TSVs
00:16 🔗 yipdw (it makes sense; split-collection doesn't account for it, so I should have seen it)
00:18 🔗 yipdw actually, huh, there's duplicate files in the TSVs
00:18 🔗 yipdw yipdw@iabak:~/archivebot$ cat archivebot-files-*.tsv | wc -l
00:18 🔗 yipdw 224275
00:18 🔗 yipdw yipdw@iabak:~/archivebot$ cat archivebot-files-*.tsv | sort | uniq | wc -l
00:18 🔗 yipdw 173163
00:19 🔗 HCross perhaps it had gotten to a point where it ran out of room in a shard to put a full file
00:19 🔗 yipdw no I mean the inputs have duplicates
00:19 🔗 db48x hrm
00:19 🔗 yipdw same MD5, same size, same URL
00:22 🔗 yipdw ok, if you eliminate the dupes and resplit you're down to 31 4 TB shards
00:23 🔗 db48x another one written by bwn that I had sitting around
00:27 🔗 yipdw ok
00:27 🔗 yipdw yipdw@iabak:~/archivebot$ cat archivebot-files-*.tsv | sort | uniq | ./sort-by-item > lol; cd what; cat ../lol | ../split-by-
00:27 🔗 yipdw size-column-into-2TB-or-closest
00:27 🔗 yipdw item boundaries ok, dupes removed, stupid filenames check
00:27 🔗 yipdw i guess I didn't really need lol
00:28 🔗 db48x that's great
00:28 🔗 yipdw HCross: if you'd like, you can copy from /home/yipdw/archivebot/what
00:28 🔗 yipdw I'd also appreciate some checks on the files to make sure I didn't do anything stupid
00:29 🔗 yipdw i have a check-item-boundaries script, but independent verification would be nice
00:31 🔗 HCross yipdw, thanks. its 00:30 here, so I am going to get some sleep, but ill check it when I get up
00:31 🔗 yipdw ok
00:34 🔗 db48x yipdw: spot checks look good
00:35 🔗 yipdw ok
00:36 🔗 yipdw I'm not sure any of these scripts I wrote have utility beyond one-offs for this shard, so I'll just keep them here for now
00:36 🔗 HCross yup, merged all the files and done a uniq -cd and it comes back with nothing
00:36 🔗 yipdw if it turns out we get a "hey we have this tool" moment we can reconsider
00:37 🔗 HCross Cool, ill start throwing out the shards in the morning
00:37 🔗 yipdw we'll have to redo this process at some point, since the archivebot collection has grown since that snapshot
00:37 🔗 yipdw but that's fine
00:38 🔗 db48x yipdw: no, there are lots of big shards to split up; go ahead and commit them
00:38 🔗 HCross once we know how, it will be so much better. This has been a huge learning process for us all
00:38 🔗 yipdw db48x: ok
00:42 🔗 db48x can also use that jq script I just committed to first group the items of a collection into chunks and then use those to get the json metadata
00:43 🔗 yipdw ah
00:43 🔗 db48x but it's better to have all the scripts committed so that others can come in and figure out how to make a shard
00:44 🔗 db48x (which is basically already a software archeology task, given the lack of documentation, but we've got to start somewhere)
00:45 🔗 yipdw is the end goal something like
00:45 🔗 yipdw ia search the-things-i-want | mkSHARD
00:46 🔗 db48x the end goal is ia | make-shards
00:47 🔗 yipdw ok
00:47 🔗 db48x ;)
00:47 🔗 db48x ia | dwim
00:48 🔗 db48x ok, I must eat, and then I must pack for my journey
00:51 🔗 closure I've documented on wiki /admin the new propellor setup
01:03 🔗 yipdw HCross: there's a bug in my splitter script, you'll want to repull the shards
01:03 🔗 antomatic has quit IRC (Ping timeout: 250 seconds)
01:04 🔗 Kaz oh dear lord what have I missed
01:04 🔗 yipdw HCross: shards regenerated in /home/yipdw/archivebot/what
01:19 🔗 antomatic has joined #internetarchive.bak
01:30 🔗 Somebody has joined #internetarchive.bak
01:52 🔗 Somebody has quit IRC (Ping timeout: 370 seconds)
02:40 🔗 Somebody has joined #internetarchive.bak
02:59 🔗 Somebody has quit IRC (Ping timeout: 370 seconds)
03:00 🔗 Somebody has joined #internetarchive.bak
03:34 🔗 Somebody has quit IRC (Ping timeout: 370 seconds)
03:40 🔗 Somebody has joined #internetarchive.bak
03:55 🔗 kyan has quit IRC (Quit: Leaving)
04:18 🔗 Somebody1 has joined #internetarchive.bak
04:21 🔗 Somebody has quit IRC (Ping timeout: 370 seconds)
06:06 🔗 Somebody1 has quit IRC (Read error: Operation timed out)
06:14 🔗 VADemon has quit IRC (Quit: left4dead)
07:00 🔗 sep332 has joined #internetarchive.bak
08:18 🔗 Aoede has quit IRC (Ping timeout: 260 seconds)
08:23 🔗 Aoede has joined #internetarchive.bak
09:31 🔗 jsp234 has joined #internetarchive.bak
09:36 🔗 jsp12345 has quit IRC (Read error: Operation timed out)
09:44 🔗 bwn has quit IRC (Read error: Operation timed out)
09:45 🔗 bwn has joined #internetarchive.bak
10:21 🔗 PurpleSym has quit IRC (*)
10:22 🔗 PurpleSym has joined #internetarchive.bak
12:03 🔗 atomotic has joined #internetarchive.bak
12:59 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
13:15 🔗 HCross Right, will start pushing out shards
13:20 🔗 HCross closure, db48x it seems that network on iabak. is broken, v6 gives me network is unreachable and DNS is painfully slow to resolve
13:33 🔗 HCross Anyone want to take Shard14 (the first ArchiveBot shard) out for a test run?
13:39 🔗 HCross SketchPho, first shard of archivebot is in!
14:10 🔗 Kaz right
14:10 🔗 Kaz let me throw some disk at it
14:12 🔗 Kaz uh, are we missing a shard13?
14:17 🔗 HCross that was yours
14:19 🔗 Kaz right
14:19 🔗 Kaz looks like I need to work out how to make things work
14:21 🔗 Kaz HCross: could you drop another one in as shard13 please? Would have a poke around but working at 4. missed a lot of discussion yesterday
14:22 🔗 joepie91 has joined #internetarchive.bak
14:34 🔗 db48x uh, hrm
14:34 🔗 db48x Broadcast message from systemd-journald@iabak (Sat 2016-11-12 02:23:55 EST):
14:34 🔗 db48x
14:34 🔗 db48x systemd[1]: Failed to run main loop: Bad address
15:02 🔗 VADemon has joined #internetarchive.bak
15:23 🔗 GLaDOS has quit IRC (Oh crap, I died.)
15:51 🔗 closure db48x: added your gpg key to propellor
16:03 🔗 db48x ah, thanks
16:35 🔗 db48x well, see you folks later
16:35 🔗 db48x has quit IRC (Remote host closed the connection)
16:59 🔗 Somebody has joined #internetarchive.bak
17:19 🔗 balrog has quit IRC (Read error: Operation timed out)
17:20 🔗 balrog has joined #internetarchive.bak
17:20 🔗 svchfoo3 sets mode: +o balrog
17:24 🔗 Somebody has quit IRC (Ping timeout: 370 seconds)
18:06 🔗 kyan has joined #internetarchive.bak
19:12 🔗 Somebody has joined #internetarchive.bak
19:30 🔗 kyan has quit IRC (Remote host closed the connection)
20:26 🔗 kyan has joined #internetarchive.bak
20:40 🔗 Start has quit IRC (Quit: Disconnected.)
21:58 🔗 SketchPho Hurrah
22:02 🔗 HCross Im slowly adding archivebot, bit by bit
22:31 🔗 Start has joined #internetarchive.bak
23:17 🔗 Somebody has quit IRC (Ping timeout: 370 seconds)

irclogger-viewer