[00:50] <db48x> the other thing we should do is decide on what sort of indirection we should use to split things off onto multiple servers, stand up another VM and split the load between them
[01:10] <db48x> identical servers with round-robin dns would be simple, and I can't think of any reason why it wouldn't be effective
[01:10] <robogoat> I can try to bring some more storage online.
[01:11] <robogoat> But aren't there mirrorings of the archive in places?
[01:11] <robogoat> Some universities, multiple countries?
[01:12] <db48x> IA has it's own redundancies, yes
[01:13] <db48x> some universities probably mirror specific collections which they helped create or curate
[01:14] <robogoat> So is the concern more about archiveteam stuff? 
[01:16] <db48x> that's part of it
[01:16] <db48x> archiveteam relies quite a lot on being able to upload things to IA
[01:17] <db48x> none of us have the amount of cash it would take to duplicate that infrastructure
[01:18] <robogoat> Yeah, we need an archiveteam lottery winner :p
[01:18] <db48x> ah, that's a good idea
[01:29] <robogoat> So it's not that there's likely an existential threat to the archive, but some specific datasets?
[01:30] <db48x> there's no known existential threat to the Archive, or to any specific items or collections stored by the Archive
[01:31] <db48x> I mean, aside from copyright
[01:31] <db48x> of course, I'm not really an insider
[01:31] <robogoat> < SketchCow> I think we're a target (IA) again
[01:32] <robogoat> < SketchCow> I think they've gotten a hang of things, and we'll be gone after
[01:32] <db48x> actually, I was just thinking about the Amiga collection on IA
[01:32] <robogoat> Those are some wserious words.
[01:33] <db48x> one person claims to own everything Amiga-related, and a lot of things are dark because of it
[01:33] <robogoat> Yeah, darking things is complicated.
[01:34] <db48x> so maybe there are threats to collections, but I think they're all copyright-based
[01:35] <db48x> and I can't see how that would be worse now than it was last year
[01:35] <db48x> maybe SketchCow will enlighten us
[01:35] <robogoat> I have always worried about the IA from a security perspective.
[01:36] <db48x> how so?
[01:37] <robogoat> Depending on the motives, determination, and scruples of the adversary, it might be easier to compromise IA and remove the data that way.
[01:39] <db48x> I suppose it's possible
[01:40] <db48x> though it seems quite unlikely that an ordinary copyright-holder would go to such lengths, when a DMCA takedown is so easy
[01:40] <robogoat> If, for example, people felt like the IA were archiving data that had evidence that would counter statements made by an interest.
[01:41] <robogoat> I guess, why do you think that copyright-holders are the only ones who would have an interest in preventing access to IA information?
[01:42] <robogoat> We just need Amazon to donate one of these: https://aws.amazon.com/snowmobile/
[01:42] <db48x> mmm, that's something I hadn't thought of
[01:43] <db48x> heh, that'd be fun to play with
[01:43] <db48x> even Glacier is quite expensive, though
[01:44] <robogoat> ?!?
[01:44] <robogoat> $0.004 per GB / Month is expensive?
[01:44] <robogoat> So that's $4/TB/mo.
[01:44] <db48x> $4000/PB/mo
[01:45] <db48x> personally, that's pretty expensive
[01:45] <robogoat> Personally, absoultely.
[01:46] <db48x> and we could rent a room somewhere and put hard drives in it for a lot less
[01:46] <robogoat> Mmmm, interesting. Could you?
[01:46] <robogoat> Amortized out over time you could.
[01:46] <robogoat> But by how much?
[01:47] <db48x> unknown
[01:47] <robogoat> Why is it unknown?
[01:47] <robogoat> What variable is missing?
[01:47] <db48x> because I've never actually sat down and priced everything out
[01:48] <robogoat> Yeah, I mean, what you're pricing depends on a lot of variables.
[01:48] <db48x> it seems obvious that a climate-controlled room containing a PB worth of unpowered disks would be cheaper
[01:48] <db48x> I'm not so sure about a PB of powered up disks, ready to serve all the data
[01:49] <robogoat> Right, and storing unpowered disks is slightly more risky.
[01:49] <robogoat> Because you want to turn them on to make sure they're good,
[01:49] <db48x> on the other hand, paying someone to plug those disks in one at a time to verify that they are working and report that status to the iabak server might not be cheaper :P
[01:49] <db48x> yea
[01:49] <robogoat> Having put a lot of thought into this,
[01:50] <db48x> although I guess testing 5 8TB drives per day wouldn't really take a lot of time
[01:50] <robogoat> I think having disks racked is the only way to go.
[01:50] <db48x> plug them in to a computer in the morning, let the software run as long as it needs to run, then put them back on the shelf at the end of the day
[01:50] <db48x> that would cycle them all once a month
[01:51] <robogoat> But that's still only one petabyte.
[01:51] <robogoat> Which is whatever small fraction of IA
[01:51] <db48x> true, but it's a nice round number
[01:52] <db48x> and SketchCow once told me that it was a good target
[01:52] <robogoat> I don't really disagree with any of those.
[01:52] <db48x> a lot of the 20-something PB that IA has is for the Wayback Machine
[01:52] <db48x> as outsiders, we can't actually download the Wayback Machine's data directly
[01:53] <robogoat> Huh?
[01:53] <robogoat> I think it's possible.
[01:53] <db48x> (it's stored as WARC files in ordinary IA items, but outsiders don't have the permission to download the items directly)
[01:54] <db48x> (we could obvious scrape the Wayback Machine itself, but that would be annoying and time-consuming and dumb
[01:54] <db48x> )
[01:56] <robogoat> Yeah, IDK, been a long time since I looked into it.
[01:58] <db48x> I guess if I had $4000/month to spare, or as income coming in, then I'd take on the responsibility to rotate the disks every day
[01:58] <db48x> it doesn't really make for a compelling story of distributed backups though :)
[01:59] <db48x> well, I require some comestibles
[01:59] <db48x> back after a while
[02:19] *** antomati_ has joined #internetarchive.bak
[02:19] *** ivan` has joined #internetarchive.bak
[02:19] *** antomatic has quit IRC (hub.efnet.us irc.colosolutions.net)
[02:19] *** ivan has quit IRC (hub.efnet.us irc.colosolutions.net)
[02:19] *** trs80 has quit IRC (hub.efnet.us irc.colosolutions.net)
[02:19] *** pikhq has quit IRC (hub.efnet.us irc.colosolutions.net)
[02:19] *** Frogging has quit IRC (hub.efnet.us irc.colosolutions.net)
[02:19] *** CyberJaco has quit IRC (hub.efnet.us irc.colosolutions.net)
[02:19] *** joepie91 has quit IRC (hub.efnet.us irc.colosolutions.net)
[02:22] *** pikhq_ has joined #internetarchive.bak
[02:26] *** CyberJac- has joined #internetarchive.bak
[02:26] *** FluffyFox has joined #internetarchive.bak
[02:30] *** joepie91_ has joined #internetarchive.bak
[02:35] *** CyberJac- is now known as CyberJaco
[02:35] *** FluffyFox is now known as Frogging
[05:04] <closure> db48x: for adding another server, the repolist has the server hostname in the uri for each repo, so each server can own its own set of repos. Then disk requirements for the server doesn't need to be as large
[05:08] <db48x> I know, but then there's no redundancy
[05:09] <db48x> and anyway, weren't we bottlenecked on IOPS when syncing, not actual disk space?
[05:12] <closure> well, probably depends on how generous the server provider is
[05:14] <db48x> 59GB for all the shards on the server; that's not as small as I'd expected
[05:15] <db48x> weird variation, too
[05:15] <db48x> SHARD6 is 16GB, and SHARD24 is 56MB
[05:16] <db48x> age is a good coorelate, but SHARD1 is only 509MB :P
[05:16] <SketchCow> One petabyte is a good number.
[05:16] <SketchCow> robogoat: Don't bikeshed
[05:18] <db48x> SketchCow: that's a bit harsh; glacier would genuinely be easier, and it's fun to examine the architectural possibilities, and it's a long way from asking about the color of the paint
[05:21] <SketchCow> Glacier is hugely expensive, it's a trick.
[05:21] <SketchCow> And this was covered over a year and half ago.
[05:21] <db48x> sure
[05:21] <SketchCow> You're more patient than me.
[05:22] <SketchCow> Reddit says I'm awful to deal with, I don't know how you do it
[05:22] <db48x> mostly by not doing it as often
[05:22] <db48x> and you're not wrong; eventually the signal-to-noise ratio of the community will drop to where nobody is interested anymore
[05:25] <SketchCow> I think that the thing to do with IA.BAK is not jam back to base premises like a lot of good people didn't spend weeks doing so at the beginning.
[05:26] <SketchCow> I'm sorry I've come in so late into this discussion, but I was in other windows.
[05:26] <db48x> then we need better documentation, so that new folks can answer their questions without bothering everyone else
[05:27] <SketchCow> https://www.archiveteam.org/index.php/INTERNETARCHIVE.BAK should get a little more refresh then, I'll see what I cna do
[05:35] <robogoat> SketchCow: I'm certainly not advocating for changing the current system whatsoever, and I realize that glacier can be hugely expensive (shit, anything AWS basically).
[05:35] <robogoat> But it does depend on how you use it.
[05:36] <robogoat> SketchCow: I personally just want to know if what has you on edge is primarily regulatory in nature or something else.
[05:36] <robogoat> regulatory being copyright/dmca/etc.
[05:37] <robogoat> I've had 4TB of data in AWS for a few months, and it is totally a trap.
[05:37] <robogoat> Because to get it out would cost me 4 months of storage.
[05:38] <robogoat> Which is a bullet I don't want to bite yet.
[05:38] <robogoat> But I couldn't collect the data anywhere besides aws cause reasons.
[05:41] <SketchCow> I am sure you want to know.
[05:41] <SketchCow> Do not sacrifice resources you do not have for this project.
[05:43] <SketchCow> And AWS is no refuge
[05:46] <SketchCow> But yes, obviously the page needs more documentation now that it's officially 3 years old (!)
[05:53] <db48x> that is pretty surprising
[05:57] <robogoat> IDK, I've pretty much been idling in here waiting for something to happen.
[05:58] <robogoat> I bought some drives from a guy, and he played a dirty trick on me, saying to the effect of "there's a raid array on these drives with interesting stuff on it, but you have no raid controller, enjoy!"
[05:58] <db48x> heh
[05:59] <robogoat> So I haven't put them into service yet because the raid array is no clearly identifiable thing, and of course it's hard to just wipe drives with "interesting stuff"
[05:59] <robogoat> Maybe time to crank them up.
[06:00] <SketchCow> Well, I have been balancing "pushing us into doing this full-bore" and "this has turned out to be a very hard problem".
[06:00] <SketchCow> But I think we really need to have it flowing and get people coming in.
[06:08] <db48x> closure: have you got a second?
[06:08] <db48x> Mar 12 23:07:40 erebor iabak-cronjob[30552]: git-annex: .git/annex/objects/Xk/Z6/MD5-s234--1033faa7c58b7eb6f5703b3a77779024: setFileMode: permission denied (Operation not permitted)
[06:08] <db48x> what mode is it trying set?
[11:50] *** atomotic has joined #internetarchive.bak
[12:33] *** atomotic has quit IRC (Quit: atomotic)
[13:45] *** atomotic has joined #internetarchive.bak
[15:00] *** atomotic has quit IRC (Quit: atomotic)
[15:22] *** atomotic has joined #internetarchive.bak
[16:23] *** atomotic has quit IRC (Quit: atomotic)
[16:36] *** Mateon1 has quit IRC (Read error: Operation timed out)
[16:37] *** Mateon1 has joined #internetarchive.bak
[18:48] *** db48x has quit IRC (Remote host closed the connection)
[21:47] *** db48x has joined #internetarchive.bak