#internetarchive.bak 2018-03-13,Tue

↑back Search

Time Nickname Message
00:50 🔗 db48x the other thing we should do is decide on what sort of indirection we should use to split things off onto multiple servers, stand up another VM and split the load between them
01:10 🔗 db48x identical servers with round-robin dns would be simple, and I can't think of any reason why it wouldn't be effective
01:10 🔗 robogoat I can try to bring some more storage online.
01:11 🔗 robogoat But aren't there mirrorings of the archive in places?
01:11 🔗 robogoat Some universities, multiple countries?
01:12 🔗 db48x IA has it's own redundancies, yes
01:13 🔗 db48x some universities probably mirror specific collections which they helped create or curate
01:14 🔗 robogoat So is the concern more about archiveteam stuff?
01:16 🔗 db48x that's part of it
01:16 🔗 db48x archiveteam relies quite a lot on being able to upload things to IA
01:17 🔗 db48x none of us have the amount of cash it would take to duplicate that infrastructure
01:18 🔗 robogoat Yeah, we need an archiveteam lottery winner :p
01:18 🔗 db48x ah, that's a good idea
01:29 🔗 robogoat So it's not that there's likely an existential threat to the archive, but some specific datasets?
01:30 🔗 db48x there's no known existential threat to the Archive, or to any specific items or collections stored by the Archive
01:31 🔗 db48x I mean, aside from copyright
01:31 🔗 db48x of course, I'm not really an insider
01:31 🔗 robogoat < SketchCow> I think we're a target (IA) again
01:32 🔗 robogoat < SketchCow> I think they've gotten a hang of things, and we'll be gone after
01:32 🔗 db48x actually, I was just thinking about the Amiga collection on IA
01:32 🔗 robogoat Those are some wserious words.
01:33 🔗 db48x one person claims to own everything Amiga-related, and a lot of things are dark because of it
01:33 🔗 robogoat Yeah, darking things is complicated.
01:34 🔗 db48x so maybe there are threats to collections, but I think they're all copyright-based
01:35 🔗 db48x and I can't see how that would be worse now than it was last year
01:35 🔗 db48x maybe SketchCow will enlighten us
01:35 🔗 robogoat I have always worried about the IA from a security perspective.
01:36 🔗 db48x how so?
01:37 🔗 robogoat Depending on the motives, determination, and scruples of the adversary, it might be easier to compromise IA and remove the data that way.
01:39 🔗 db48x I suppose it's possible
01:40 🔗 db48x though it seems quite unlikely that an ordinary copyright-holder would go to such lengths, when a DMCA takedown is so easy
01:40 🔗 robogoat If, for example, people felt like the IA were archiving data that had evidence that would counter statements made by an interest.
01:41 🔗 robogoat I guess, why do you think that copyright-holders are the only ones who would have an interest in preventing access to IA information?
01:42 🔗 robogoat We just need Amazon to donate one of these: https://aws.amazon.com/snowmobile/
01:42 🔗 db48x mmm, that's something I hadn't thought of
01:43 🔗 db48x heh, that'd be fun to play with
01:43 🔗 db48x even Glacier is quite expensive, though
01:44 🔗 robogoat ?!?
01:44 🔗 robogoat $0.004 per GB / Month is expensive?
01:44 🔗 robogoat So that's $4/TB/mo.
01:44 🔗 db48x $4000/PB/mo
01:45 🔗 db48x personally, that's pretty expensive
01:45 🔗 robogoat Personally, absoultely.
01:46 🔗 db48x and we could rent a room somewhere and put hard drives in it for a lot less
01:46 🔗 robogoat Mmmm, interesting. Could you?
01:46 🔗 robogoat Amortized out over time you could.
01:46 🔗 robogoat But by how much?
01:47 🔗 db48x unknown
01:47 🔗 robogoat Why is it unknown?
01:47 🔗 robogoat What variable is missing?
01:47 🔗 db48x because I've never actually sat down and priced everything out
01:48 🔗 robogoat Yeah, I mean, what you're pricing depends on a lot of variables.
01:48 🔗 db48x it seems obvious that a climate-controlled room containing a PB worth of unpowered disks would be cheaper
01:48 🔗 db48x I'm not so sure about a PB of powered up disks, ready to serve all the data
01:49 🔗 robogoat Right, and storing unpowered disks is slightly more risky.
01:49 🔗 robogoat Because you want to turn them on to make sure they're good,
01:49 🔗 db48x on the other hand, paying someone to plug those disks in one at a time to verify that they are working and report that status to the iabak server might not be cheaper :P
01:49 🔗 db48x yea
01:49 🔗 robogoat Having put a lot of thought into this,
01:50 🔗 db48x although I guess testing 5 8TB drives per day wouldn't really take a lot of time
01:50 🔗 robogoat I think having disks racked is the only way to go.
01:50 🔗 db48x plug them in to a computer in the morning, let the software run as long as it needs to run, then put them back on the shelf at the end of the day
01:50 🔗 db48x that would cycle them all once a month
01:51 🔗 robogoat But that's still only one petabyte.
01:51 🔗 robogoat Which is whatever small fraction of IA
01:51 🔗 db48x true, but it's a nice round number
01:52 🔗 db48x and SketchCow once told me that it was a good target
01:52 🔗 robogoat I don't really disagree with any of those.
01:52 🔗 db48x a lot of the 20-something PB that IA has is for the Wayback Machine
01:52 🔗 db48x as outsiders, we can't actually download the Wayback Machine's data directly
01:53 🔗 robogoat Huh?
01:53 🔗 robogoat I think it's possible.
01:53 🔗 db48x (it's stored as WARC files in ordinary IA items, but outsiders don't have the permission to download the items directly)
01:54 🔗 db48x (we could obvious scrape the Wayback Machine itself, but that would be annoying and time-consuming and dumb
01:54 🔗 db48x )
01:56 🔗 robogoat Yeah, IDK, been a long time since I looked into it.
01:58 🔗 db48x I guess if I had $4000/month to spare, or as income coming in, then I'd take on the responsibility to rotate the disks every day
01:58 🔗 db48x it doesn't really make for a compelling story of distributed backups though :)
01:59 🔗 db48x well, I require some comestibles
01:59 🔗 db48x back after a while
02:19 🔗 antomati_ has joined #internetarchive.bak
02:19 🔗 ivan` has joined #internetarchive.bak
02:19 🔗 antomatic has quit IRC (hub.efnet.us irc.colosolutions.net)
02:19 🔗 ivan has quit IRC (hub.efnet.us irc.colosolutions.net)
02:19 🔗 trs80 has quit IRC (hub.efnet.us irc.colosolutions.net)
02:19 🔗 pikhq has quit IRC (hub.efnet.us irc.colosolutions.net)
02:19 🔗 Frogging has quit IRC (hub.efnet.us irc.colosolutions.net)
02:19 🔗 CyberJaco has quit IRC (hub.efnet.us irc.colosolutions.net)
02:19 🔗 joepie91 has quit IRC (hub.efnet.us irc.colosolutions.net)
02:22 🔗 pikhq_ has joined #internetarchive.bak
02:26 🔗 CyberJac- has joined #internetarchive.bak
02:26 🔗 FluffyFox has joined #internetarchive.bak
02:30 🔗 joepie91_ has joined #internetarchive.bak
02:35 🔗 CyberJac- is now known as CyberJaco
02:35 🔗 FluffyFox is now known as Frogging
05:04 🔗 closure db48x: for adding another server, the repolist has the server hostname in the uri for each repo, so each server can own its own set of repos. Then disk requirements for the server doesn't need to be as large
05:08 🔗 db48x I know, but then there's no redundancy
05:09 🔗 db48x and anyway, weren't we bottlenecked on IOPS when syncing, not actual disk space?
05:12 🔗 closure well, probably depends on how generous the server provider is
05:14 🔗 db48x 59GB for all the shards on the server; that's not as small as I'd expected
05:15 🔗 db48x weird variation, too
05:15 🔗 db48x SHARD6 is 16GB, and SHARD24 is 56MB
05:16 🔗 db48x age is a good coorelate, but SHARD1 is only 509MB :P
05:16 🔗 SketchCow One petabyte is a good number.
05:16 🔗 SketchCow robogoat: Don't bikeshed
05:18 🔗 db48x SketchCow: that's a bit harsh; glacier would genuinely be easier, and it's fun to examine the architectural possibilities, and it's a long way from asking about the color of the paint
05:21 🔗 SketchCow Glacier is hugely expensive, it's a trick.
05:21 🔗 SketchCow And this was covered over a year and half ago.
05:21 🔗 db48x sure
05:21 🔗 SketchCow You're more patient than me.
05:22 🔗 SketchCow Reddit says I'm awful to deal with, I don't know how you do it
05:22 🔗 db48x mostly by not doing it as often
05:22 🔗 db48x and you're not wrong; eventually the signal-to-noise ratio of the community will drop to where nobody is interested anymore
05:25 🔗 SketchCow I think that the thing to do with IA.BAK is not jam back to base premises like a lot of good people didn't spend weeks doing so at the beginning.
05:26 🔗 SketchCow I'm sorry I've come in so late into this discussion, but I was in other windows.
05:26 🔗 db48x then we need better documentation, so that new folks can answer their questions without bothering everyone else
05:27 🔗 SketchCow https://www.archiveteam.org/index.php/INTERNETARCHIVE.BAK should get a little more refresh then, I'll see what I cna do
05:35 🔗 robogoat SketchCow: I'm certainly not advocating for changing the current system whatsoever, and I realize that glacier can be hugely expensive (shit, anything AWS basically).
05:35 🔗 robogoat But it does depend on how you use it.
05:36 🔗 robogoat SketchCow: I personally just want to know if what has you on edge is primarily regulatory in nature or something else.
05:36 🔗 robogoat regulatory being copyright/dmca/etc.
05:37 🔗 robogoat I've had 4TB of data in AWS for a few months, and it is totally a trap.
05:37 🔗 robogoat Because to get it out would cost me 4 months of storage.
05:38 🔗 robogoat Which is a bullet I don't want to bite yet.
05:38 🔗 robogoat But I couldn't collect the data anywhere besides aws cause reasons.
05:41 🔗 SketchCow I am sure you want to know.
05:41 🔗 SketchCow Do not sacrifice resources you do not have for this project.
05:43 🔗 SketchCow And AWS is no refuge
05:46 🔗 SketchCow But yes, obviously the page needs more documentation now that it's officially 3 years old (!)
05:53 🔗 db48x that is pretty surprising
05:57 🔗 robogoat IDK, I've pretty much been idling in here waiting for something to happen.
05:58 🔗 robogoat I bought some drives from a guy, and he played a dirty trick on me, saying to the effect of "there's a raid array on these drives with interesting stuff on it, but you have no raid controller, enjoy!"
05:58 🔗 db48x heh
05:59 🔗 robogoat So I haven't put them into service yet because the raid array is no clearly identifiable thing, and of course it's hard to just wipe drives with "interesting stuff"
05:59 🔗 robogoat Maybe time to crank them up.
06:00 🔗 SketchCow Well, I have been balancing "pushing us into doing this full-bore" and "this has turned out to be a very hard problem".
06:00 🔗 SketchCow But I think we really need to have it flowing and get people coming in.
06:08 🔗 db48x closure: have you got a second?
06:08 🔗 db48x Mar 12 23:07:40 erebor iabak-cronjob[30552]: git-annex: .git/annex/objects/Xk/Z6/MD5-s234--1033faa7c58b7eb6f5703b3a77779024: setFileMode: permission denied (Operation not permitted)
06:08 🔗 db48x what mode is it trying set?
11:50 🔗 atomotic has joined #internetarchive.bak
12:33 🔗 atomotic has quit IRC (Quit: atomotic)
13:45 🔗 atomotic has joined #internetarchive.bak
15:00 🔗 atomotic has quit IRC (Quit: atomotic)
15:22 🔗 atomotic has joined #internetarchive.bak
16:23 🔗 atomotic has quit IRC (Quit: atomotic)
16:36 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
16:37 🔗 Mateon1 has joined #internetarchive.bak
18:48 🔗 db48x has quit IRC (Remote host closed the connection)
21:47 🔗 db48x has joined #internetarchive.bak

irclogger-viewer