#internetarchive.bak 2017-12-22,Fri

↑back Search

Time Nickname Message
00:00 🔗 Jon has quit IRC (Quit: ZNC - http://znc.in)
01:28 🔗 wp494 has quit IRC (Read error: Operation timed out)
01:35 🔗 wp494 has joined #internetarchive.bak
01:46 🔗 wp494_ has joined #internetarchive.bak
01:49 🔗 wp494 has quit IRC (Read error: Operation timed out)
02:25 🔗 wp494_ is now known as wp494
04:24 🔗 wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES)
04:25 🔗 wp494 has joined #internetarchive.bak
07:18 🔗 Mateon1 has quit IRC (Ping timeout: 260 seconds)
07:18 🔗 Mateon1 has joined #internetarchive.bak
10:36 🔗 Mateon1 has quit IRC (Remote host closed the connection)
10:36 🔗 Mateon1 has joined #internetarchive.bak
17:41 🔗 beardicus has quit IRC (bye)
17:45 🔗 beardicus has joined #internetarchive.bak
21:14 🔗 sep332 has quit IRC (Read error: Operation timed out)
21:22 🔗 ez has joined #internetarchive.bak
21:22 🔗 ez Somebody2: well, not sure if this place is more suitabel for what-ifs than -bs but anyway
21:23 🔗 Somebody2 ez: Eh, it's still more on the topic. :-)
21:23 🔗 Somebody2 And it enables people who don't care about it to ignore it more easily.
21:23 🔗 ez if you have wel define space of items, say 0-100, and everyone picks a random item off it, you get near 99% coverage after 5 or so redundant replicas
21:23 🔗 ez this works without any coordination, provided all participants pick uniformly randomly (they have no reason not to, they're altruistic after all)
21:23 🔗 Somebody2 ez: I see.
21:24 🔗 Somebody2 But I'm not convinced that the lack of uptake of IA.BAK is due to the (very low) coordination requirement.
21:24 🔗 ez Somebody2: for a starting point, i'd probably consider the timestamped snapshot
21:24 🔗 ez its not perfect, but might work reasonably well enough
21:25 🔗 ez then after some time, do a second snapshot delta from that, and do the same dance
21:25 🔗 Somebody2 Also, if you have no coordination, you have no way to *restore* the backup.
21:25 🔗 Somebody2 At least, not without out-of-band efforts.
21:25 🔗 ez indeed you dont, you might have only vague estimate how spotty the global backup is
21:26 🔗 ez in fact, i'd just straight burn the randomly chosen items to drives which are not worth the electricity to keep online
21:26 🔗 ez fill 250GB drive, pile it in the closet
21:27 🔗 Somebody2 So given that, I'm not sure how much it matters whether the random space is uniform or not.
21:27 🔗 ez Somebody2: the restoration problem is not related to lack of coordination, but having high replica count
21:27 🔗 ez you need high amount of replicas anyway
21:27 🔗 Somebody2 Wait, how does having a high replica count affect restoration?
21:28 🔗 ez well, you obviously need to encode with RS 255,255-85
21:28 🔗 Somebody2 What is RS 255,255-85?
21:28 🔗 ez you need 85 things out of 255 to restore original 85 things
21:28 🔗 ez in laymen terms 1of3 ECC code
21:28 🔗 Somebody2 OK...
21:28 🔗 ez the numbers are high because you have insane statistic variance
21:29 🔗 Somebody2 But if you don't have a way to contact *any* of the replicas, you can't restore in any case.
21:29 🔗 ez Somebody2: so say, iabak goes bust and now everyone needs to restore
21:29 🔗 ez they open their closets
21:29 🔗 Somebody2 And if you *do* have a way to contact the replicas, that's coordination.
21:29 🔗 ez and as long at least 1/3 of total what was backed up
21:29 🔗 ez *any* 1/3
21:29 🔗 ez you get the whole archive
21:30 🔗 Somebody2 Ah, so you contact the replicas AFTER THE FACT, when restoration is needed.
21:30 🔗 ez Somebody2: they have useless shards of RS code
21:30 🔗 ez they *need* to coordinate when restoring
21:30 🔗 ez but not when backing up
21:30 🔗 Somebody2 I see. OK, yeah, I can see that being a possible improvement.
21:31 🔗 Somebody2 It does prevent being able to get any idea of the progress of the backup (without calling for restoration), though.
21:31 🔗 ez Somebody2: it is indeed what-if, as im making different motivation assumptions than you have
21:31 🔗 ez you have assumption that people are interested in seti-at-home online server/client architecture, which is fine
21:32 🔗 Somebody2 No, one of the explicit goals of the effort was to allow people to store the HDs offline, and plug in them in briefly once a month or so.
21:32 🔗 ez hmm, thats neat
21:32 🔗 ez though the spin ups could be still seen as a lot of bother
21:32 🔗 Somebody2 Yep!
21:33 🔗 ez i mean theres a lot of emphasis on tracking replicas
21:33 🔗 ez which makes no sense
21:33 🔗 Somebody2 But if you don't test backup media regularly, you should assume it's unrecoverable.
21:33 🔗 Somebody2 ez: Oh? Why does tracking replicas make no sense?
21:34 🔗 ez when massive spread is done with erasure coding and you hit certain average replica count
21:34 🔗 ez whether your drive failed or not doesnt matter as much
21:34 🔗 ez your drive failure slightly lowered chance of recovery across a wide board
21:34 🔗 Somebody2 But you'd still need to report back average replica count in order to get progress reports, though.
21:35 🔗 ez Somebody2: of course this makes wild assumptions that there is sufficient capacity, which there isnt
21:35 🔗 ez for stochastic replication to work reasonably, you'd need to restrict the subset
21:35 🔗 ez but you need to do that anyway to achieve uniform randomness
21:35 🔗 ez (which is currently done with shards in fact)
21:36 🔗 ez Somebody2: yea, it would need some fancy statistics
21:36 🔗 ez like, knowing the split of "people keeping data online vs keeping it in closet"
21:36 🔗 ez not sure how to arrive at that number
21:36 🔗 ez but once you have it, you can infer total numbers
21:39 🔗 Somebody2 ez: I mean, don't you just need a count of "replicas (of any single shard)"?
21:39 🔗 ez Somebody2: in practice, the stochastic domain would live each in their shard, yea
21:40 🔗 ez so, a single volunteer would pick a shard, and start picking random items off it
21:40 🔗 ez until through some vague quorum prototocol it is agreed the shard is sufficient
21:40 🔗 ez then it moves onto another shard
21:41 🔗 ez quorum can be fairly simple proofs of possesion. however it doesnt solve the closet problem in straighforward manner
21:42 🔗 ez Somebody2: total progress would be metered in terms of shards which with observed sufficient online count (provided we know the closet number)
21:42 🔗 Somebody2 ez: Well, what we have now already does that...
21:42 🔗 ez Somebody2: yep. the closet number cant be figured out easily without central authority
21:44 🔗 Somebody2 The closet number can't be found out at all.
21:44 🔗 ez under assumptions most participants are honest about it, it can
21:45 🔗 ez Somebody2: it is my understanding current sharding doesnt use erasure coding
21:45 🔗 Somebody2 ez: Not without asking people to plug in the HDs in their closet once a month.
21:45 🔗 ez which worsens the situation a lot regarding closet
21:45 🔗 ez as you have no wiggle room
21:45 🔗 Somebody2 Which is already what we do, I think.
21:45 🔗 Somebody2 The current sharding uses full mirroring, rather than erasure coding, yes (I think).
21:46 🔗 ez Somebody2: one way to figure out closet might be indeed every 6-month or so check
21:46 🔗 Somebody2 But we do have multiple replicas of each shard
21:46 🔗 Somebody2 So there's the wiggle room.
21:46 🔗 Somebody2 I'm not sure how erasure coding would give us more.
21:46 🔗 ez theres also the issue of inefficiency
21:47 🔗 ez Somebody2: you're making assumptions that either whole shard disappears or not
21:47 🔗 ez thats where inefficiency comes from
21:47 🔗 ez in reality, only fragments of shard may disappear
21:48 🔗 ez so any system, centralized or not, has to make sure that theres enough fragments in each shard to make the EC recoverable
21:48 🔗 Somebody2 ez: No, if we have 4 full copies of shard3, say -- and each one loses 15%; as long as all four didn't lose the *same* data, we can still recover all of it.
21:49 🔗 ez Somebody2: yes
21:49 🔗 ez first, the chance of 15% overlap is quite high
21:49 🔗 ez second, you lost 15% across the board and already have high chance of failure
21:50 🔗 ez and thats when using 4x more than you need to.
21:50 🔗 ez with 1of3 you can lose 66% across the board, and still have full recoverability
21:51 🔗 Somebody2 ez: I see.
21:52 🔗 ez (i still like full mirrors for the simplicity of it, and they do in fact perform better than RS on small sets)
21:52 🔗 ez but RS with aggresive settings like 85 out of 255 works a bit like magic compared to that
21:52 🔗 Somebody2 And full mirroring also has the advantage of being transparent to the storage providers
21:53 🔗 Somebody2 So people don't have to hold data they don't want to
21:53 🔗 ez yea, with RS everyone would have to hold "garbage" they cant recover without help of bunch of random folks
21:53 🔗 Somebody2 So that's why I still think the blocks to further progress on ia.bak are easier to install clients for more platforms, and promotion.
21:56 🔗 ez Somebody2: its kinda moot point anyway, as RS, on big scales, can save, perhaps, 2x-3x storage compared to mirroring. its an improvement, but not vast improvement to warrant the complexity and issue you mention
21:56 🔗 Somebody2 Nods.
21:57 🔗 Somebody2 ez: Are you intersted/able to write improvements to our existing clients?
21:57 🔗 ez honestly, im quite pessimistic about it
21:57 🔗 ez no way in hell 100PB+ will appear out of thin air
21:58 🔗 ez so im more like daydreaming to shift the paradigm way off, which could, perhaps work better
21:59 🔗 ez rather than incremental improvements of current paradigm im fairly convinced cant be much improved on anymore
22:00 🔗 sep332 has joined #internetarchive.bak
22:05 🔗 sep332 has quit IRC (Read error: Operation timed out)
22:09 🔗 Somebody2 ez: You really think our current client programs can't be improved on?
22:10 🔗 Somebody2 Or do you think they can't be improved on enough to provide 100PB+ out of thin air?
22:10 🔗 ez oh they definitely can, in terms of ux and all, youre entirely right
22:10 🔗 Somebody2 (which I agree with, but I don't think that's a reason not to improve them)
22:10 🔗 Somebody2 So, interested?
22:10 🔗 ez its just that such an improvement could delivery, say a magnitude or so
22:10 🔗 ez and am the sort of black-and-white all-or-nothing sort of guy
22:11 🔗 ez if its 0.5% or 5%, its just awfuly not enough. the venue of asking government grants for it seems far more viable tbh
22:11 🔗 ez but that doesnt warrant much of improvement on client side
22:15 🔗 ez in terms of lobbying, heres an idea: business often liquidate hardware not worth operating (meaning to keep it online). instead of asking for a grant to buy new hardware, get something rolling in the vein of "ecologic disposal" of such hardware
22:15 🔗 Somebody2 Nice idea.
22:15 🔗 Somebody2 I sent the email to the Norwegian folks just now. Who knows how it will go, but it's done at least.
22:15 🔗 ez am not sure if the logistics involved are worth it though. we're talking behemot NAS arrays with iscsi 250GB drives in it
22:19 🔗 ez Somebody2: in any case, if a project specificaly targeting hardware much more prone to faults were involved, i'd participate to make a client with RS support
22:19 🔗 ez cause mirroring becomes pretty inadequate with such an architecture
22:20 🔗 Somebody2 What hardware would that be?
22:21 🔗 ez basically old hardware you keep off and power it on once a month, bring it all into one bunker, setup infra doing to the power-ons and check. the hardware and electricity costs are neglible, the majority of cost would be physical labor and rent for the bunker.
22:21 🔗 Somebody2 Please *DO* work on a client to support hardware like that!
22:23 🔗 ez Somebody2: again, i can pinky pie on the software side, but this is still huge endeawor meatspace-wise
22:24 🔗 ez basically some operator of the "enterprise scrapyard"
22:24 🔗 ez am not even sure such an idea is practical, the hardware is *extremely* inefficient. think 1ton rack full of scrap = 10tb
22:25 🔗 ez (thats the worst case tho, in practice its 100-500tb range)
22:28 🔗 ez so basically shitload of space with not too much of flammable material around almost for free would be adequate. i cant really think of such a place, basically some sort of warehouse in middle of nowhere?
22:53 🔗 tuluu has joined #internetarchive.bak
22:56 🔗 tuluu has left
23:16 🔗 Somebody2 ez: Eh, if we have the software, it will make working on getting the hardware more attractive.
23:41 🔗 Senji_ It somewhat distresses me how easily the whole thing could be duplicated if money were just thrown at the problem
23:42 🔗 Senji_ is now known as Senji
23:45 🔗 Senji At work, with our current systems, we could turn 100PB into 6500 m^3 of tapes in 10 years (with a little additional investment we could probably bring that 10 years down to 1 year easily.
23:46 🔗 Senji I don't think we have space to store that many tapes, but ICBW
23:47 🔗 Senji But we'd charge $2.5m a year for that
23:48 🔗 Senji (thats two tape copies; I assume we'd charge about $1.5m for one tape copy)

irclogger-viewer