Time |
Nickname |
Message |
00:00
🔗
|
yipdw |
54 shards at 3 TB |
00:00
🔗
|
db48x |
shardsize() { echo "$f: $(echo "scale=3;$(cut -f 2 $1 | paste -sd+ | bc)/1024/1024/1024/1024" | bc) TB"; } |
00:00
🔗
|
yipdw |
39 shards at 4 TB |
00:00
🔗
|
yipdw |
4 TiB |
00:00
🔗
|
yipdw |
er, 40 shards |
00:00
🔗
|
yipdw |
fucking zero base |
00:01
🔗
|
closure |
db48x: I've factored out a propellor iabak branch from joeyconfig. I can add your gpg key to it, which will let both of us access the privata data. Do you have a gpg key? |
00:01
🔗
|
db48x |
closure: I don't, actually |
00:02
🔗
|
db48x |
yipdw: 40 shards of 4*2**40 sounds perfect, actually :) |
00:02
🔗
|
yipdw |
it is numerically pleasing |
00:02
🔗
|
db48x |
closure: I have thus far contented myself with a large collection of ssh keys |
00:02
🔗
|
HCross |
that sounds perfect, if you want to dump it back in my folder I can create the actual shards if you want |
00:03
🔗
|
db48x |
yipdw: do you want to stop and fix the tiny bug both of our splitting scripts share? |
00:03
🔗
|
yipdw |
uh |
00:03
🔗
|
yipdw |
yes, because I do not know what this bug is |
00:03
🔗
|
db48x |
it splits items across neighboring shards |
00:03
🔗
|
yipdw |
oh |
00:03
🔗
|
yipdw |
right |
00:04
🔗
|
db48x |
git-annex won't care |
00:04
🔗
|
yipdw |
yeah that'd be good to fix |
00:05
🔗
|
db48x |
closure: how do you recommend that I create a key? |
00:05
🔗
|
closure |
db48x: yeah |
00:05
🔗
|
closure |
gpg --gen-key :P |
00:06
🔗
|
yipdw |
HCross: yeah one second, I'm going to modify this to respect item boundaries |
00:06
🔗
|
yipdw |
if I can |
00:07
🔗
|
HCross |
I can probably do some looping stuff to generate the shards |
00:11
🔗
|
db48x |
ok, I have generated a key |
00:16
🔗
|
yipdw |
one sec, I didn't account for items being present in two different input TSVs |
00:16
🔗
|
yipdw |
(it makes sense; split-collection doesn't account for it, so I should have seen it) |
00:18
🔗
|
yipdw |
actually, huh, there's duplicate files in the TSVs |
00:18
🔗
|
yipdw |
yipdw@iabak:~/archivebot$ cat archivebot-files-*.tsv | wc -l |
00:18
🔗
|
yipdw |
224275 |
00:18
🔗
|
yipdw |
yipdw@iabak:~/archivebot$ cat archivebot-files-*.tsv | sort | uniq | wc -l |
00:18
🔗
|
yipdw |
173163 |
00:19
🔗
|
HCross |
perhaps it had gotten to a point where it ran out of room in a shard to put a full file |
00:19
🔗
|
yipdw |
no I mean the inputs have duplicates |
00:19
🔗
|
db48x |
hrm |
00:19
🔗
|
yipdw |
same MD5, same size, same URL |
00:22
🔗
|
yipdw |
ok, if you eliminate the dupes and resplit you're down to 31 4 TB shards |
00:23
🔗
|
db48x |
another one written by bwn that I had sitting around |
00:27
🔗
|
yipdw |
ok |
00:27
🔗
|
yipdw |
yipdw@iabak:~/archivebot$ cat archivebot-files-*.tsv | sort | uniq | ./sort-by-item > lol; cd what; cat ../lol | ../split-by- |
00:27
🔗
|
yipdw |
size-column-into-2TB-or-closest |
00:27
🔗
|
yipdw |
item boundaries ok, dupes removed, stupid filenames check |
00:27
🔗
|
yipdw |
i guess I didn't really need lol |
00:28
🔗
|
db48x |
that's great |
00:28
🔗
|
yipdw |
HCross: if you'd like, you can copy from /home/yipdw/archivebot/what |
00:28
🔗
|
yipdw |
I'd also appreciate some checks on the files to make sure I didn't do anything stupid |
00:29
🔗
|
yipdw |
i have a check-item-boundaries script, but independent verification would be nice |
00:31
🔗
|
HCross |
yipdw, thanks. its 00:30 here, so I am going to get some sleep, but ill check it when I get up |
00:31
🔗
|
yipdw |
ok |
00:34
🔗
|
db48x |
yipdw: spot checks look good |
00:35
🔗
|
yipdw |
ok |
00:36
🔗
|
yipdw |
I'm not sure any of these scripts I wrote have utility beyond one-offs for this shard, so I'll just keep them here for now |
00:36
🔗
|
HCross |
yup, merged all the files and done a uniq -cd and it comes back with nothing |
00:36
🔗
|
yipdw |
if it turns out we get a "hey we have this tool" moment we can reconsider |
00:37
🔗
|
HCross |
Cool, ill start throwing out the shards in the morning |
00:37
🔗
|
yipdw |
we'll have to redo this process at some point, since the archivebot collection has grown since that snapshot |
00:37
🔗
|
yipdw |
but that's fine |
00:38
🔗
|
db48x |
yipdw: no, there are lots of big shards to split up; go ahead and commit them |
00:38
🔗
|
HCross |
once we know how, it will be so much better. This has been a huge learning process for us all |
00:38
🔗
|
yipdw |
db48x: ok |
00:42
🔗
|
db48x |
can also use that jq script I just committed to first group the items of a collection into chunks and then use those to get the json metadata |
00:43
🔗
|
yipdw |
ah |
00:43
🔗
|
db48x |
but it's better to have all the scripts committed so that others can come in and figure out how to make a shard |
00:44
🔗
|
db48x |
(which is basically already a software archeology task, given the lack of documentation, but we've got to start somewhere) |
00:45
🔗
|
yipdw |
is the end goal something like |
00:45
🔗
|
yipdw |
ia search the-things-i-want | mkSHARD |
00:46
🔗
|
db48x |
the end goal is ia | make-shards |
00:47
🔗
|
yipdw |
ok |
00:47
🔗
|
db48x |
;) |
00:47
🔗
|
db48x |
ia | dwim |
00:48
🔗
|
db48x |
ok, I must eat, and then I must pack for my journey |
00:51
🔗
|
closure |
I've documented on wiki /admin the new propellor setup |
01:03
🔗
|
yipdw |
HCross: there's a bug in my splitter script, you'll want to repull the shards |
01:03
🔗
|
|
antomatic has quit IRC (Ping timeout: 250 seconds) |
01:04
🔗
|
Kaz |
oh dear lord what have I missed |
01:04
🔗
|
yipdw |
HCross: shards regenerated in /home/yipdw/archivebot/what |
01:19
🔗
|
|
antomatic has joined #internetarchive.bak |
01:30
🔗
|
|
Somebody has joined #internetarchive.bak |
01:52
🔗
|
|
Somebody has quit IRC (Ping timeout: 370 seconds) |
02:40
🔗
|
|
Somebody has joined #internetarchive.bak |
02:59
🔗
|
|
Somebody has quit IRC (Ping timeout: 370 seconds) |
03:00
🔗
|
|
Somebody has joined #internetarchive.bak |
03:34
🔗
|
|
Somebody has quit IRC (Ping timeout: 370 seconds) |
03:40
🔗
|
|
Somebody has joined #internetarchive.bak |
03:55
🔗
|
|
kyan has quit IRC (Quit: Leaving) |
04:18
🔗
|
|
Somebody1 has joined #internetarchive.bak |
04:21
🔗
|
|
Somebody has quit IRC (Ping timeout: 370 seconds) |
06:06
🔗
|
|
Somebody1 has quit IRC (Read error: Operation timed out) |
06:14
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
07:00
🔗
|
|
sep332 has joined #internetarchive.bak |
08:18
🔗
|
|
Aoede has quit IRC (Ping timeout: 260 seconds) |
08:23
🔗
|
|
Aoede has joined #internetarchive.bak |
09:31
🔗
|
|
jsp234 has joined #internetarchive.bak |
09:36
🔗
|
|
jsp12345 has quit IRC (Read error: Operation timed out) |
09:44
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
09:45
🔗
|
|
bwn has joined #internetarchive.bak |
10:21
🔗
|
|
PurpleSym has quit IRC (*) |
10:22
🔗
|
|
PurpleSym has joined #internetarchive.bak |
12:03
🔗
|
|
atomotic has joined #internetarchive.bak |
12:59
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
13:15
🔗
|
HCross |
Right, will start pushing out shards |
13:20
🔗
|
HCross |
closure, db48x it seems that network on iabak. is broken, v6 gives me network is unreachable and DNS is painfully slow to resolve |
13:33
🔗
|
HCross |
Anyone want to take Shard14 (the first ArchiveBot shard) out for a test run? |
13:39
🔗
|
HCross |
SketchPho, first shard of archivebot is in! |
14:10
🔗
|
Kaz |
right |
14:10
🔗
|
Kaz |
let me throw some disk at it |
14:12
🔗
|
Kaz |
uh, are we missing a shard13? |
14:17
🔗
|
HCross |
that was yours |
14:19
🔗
|
Kaz |
right |
14:19
🔗
|
Kaz |
looks like I need to work out how to make things work |
14:21
🔗
|
Kaz |
HCross: could you drop another one in as shard13 please? Would have a poke around but working at 4. missed a lot of discussion yesterday |
14:22
🔗
|
|
joepie91 has joined #internetarchive.bak |
14:34
🔗
|
db48x |
uh, hrm |
14:34
🔗
|
db48x |
Broadcast message from systemd-journald@iabak (Sat 2016-11-12 02:23:55 EST): |
14:34
🔗
|
db48x |
|
14:34
🔗
|
db48x |
systemd[1]: Failed to run main loop: Bad address |
15:02
🔗
|
|
VADemon has joined #internetarchive.bak |
15:23
🔗
|
|
GLaDOS has quit IRC (Oh crap, I died.) |
15:51
🔗
|
closure |
db48x: added your gpg key to propellor |
16:03
🔗
|
db48x |
ah, thanks |
16:35
🔗
|
db48x |
well, see you folks later |
16:35
🔗
|
|
db48x has quit IRC (Remote host closed the connection) |
16:59
🔗
|
|
Somebody has joined #internetarchive.bak |
17:19
🔗
|
|
balrog has quit IRC (Read error: Operation timed out) |
17:20
🔗
|
|
balrog has joined #internetarchive.bak |
17:20
🔗
|
|
svchfoo3 sets mode: +o balrog |
17:24
🔗
|
|
Somebody has quit IRC (Ping timeout: 370 seconds) |
18:06
🔗
|
|
kyan has joined #internetarchive.bak |
19:12
🔗
|
|
Somebody has joined #internetarchive.bak |
19:30
🔗
|
|
kyan has quit IRC (Remote host closed the connection) |
20:26
🔗
|
|
kyan has joined #internetarchive.bak |
20:40
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
21:58
🔗
|
SketchPho |
Hurrah |
22:02
🔗
|
HCross |
Im slowly adding archivebot, bit by bit |
22:31
🔗
|
|
Start has joined #internetarchive.bak |
23:17
🔗
|
|
Somebody has quit IRC (Ping timeout: 370 seconds) |