#internetarchive.bak 2017-12-22,Fri

↑back Search

Time	Nickname	Message
00:00 ^🔗		Jon has quit IRC (Quit: ZNC - http://znc.in)
01:28 ^🔗		wp494 has quit IRC (Read error: Operation timed out)
01:35 ^🔗		wp494 has joined #internetarchive.bak
01:46 ^🔗		wp494_ has joined #internetarchive.bak
01:49 ^🔗		wp494 has quit IRC (Read error: Operation timed out)
02:25 ^🔗		wp494_ is now known as wp494
04:24 ^🔗		wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES)
04:25 ^🔗		wp494 has joined #internetarchive.bak
07:18 ^🔗		Mateon1 has quit IRC (Ping timeout: 260 seconds)
07:18 ^🔗		Mateon1 has joined #internetarchive.bak
10:36 ^🔗		Mateon1 has quit IRC (Remote host closed the connection)
10:36 ^🔗		Mateon1 has joined #internetarchive.bak
17:41 ^🔗		beardicus has quit IRC (bye)
17:45 ^🔗		beardicus has joined #internetarchive.bak
21:14 ^🔗		sep332 has quit IRC (Read error: Operation timed out)
21:22 ^🔗		ez has joined #internetarchive.bak
21:22 ^🔗	ez	Somebody2: well, not sure if this place is more suitabel for what-ifs than -bs but anyway
21:23 ^🔗	Somebody2	ez: Eh, it's still more on the topic. :-)
21:23 ^🔗	Somebody2	And it enables people who don't care about it to ignore it more easily.
21:23 ^🔗	ez	if you have wel define space of items, say 0-100, and everyone picks a random item off it, you get near 99% coverage after 5 or so redundant replicas
21:23 ^🔗	ez	this works without any coordination, provided all participants pick uniformly randomly (they have no reason not to, they're altruistic after all)
21:23 ^🔗	Somebody2	ez: I see.
21:24 ^🔗	Somebody2	But I'm not convinced that the lack of uptake of IA.BAK is due to the (very low) coordination requirement.
21:24 ^🔗	ez	Somebody2: for a starting point, i'd probably consider the timestamped snapshot
21:24 ^🔗	ez	its not perfect, but might work reasonably well enough
21:25 ^🔗	ez	then after some time, do a second snapshot delta from that, and do the same dance
21:25 ^🔗	Somebody2	Also, if you have no coordination, you have no way to restore the backup.
21:25 ^🔗	Somebody2	At least, not without out-of-band efforts.
21:25 ^🔗	ez	indeed you dont, you might have only vague estimate how spotty the global backup is
21:26 ^🔗	ez	in fact, i'd just straight burn the randomly chosen items to drives which are not worth the electricity to keep online
21:26 ^🔗	ez	fill 250GB drive, pile it in the closet
21:27 ^🔗	Somebody2	So given that, I'm not sure how much it matters whether the random space is uniform or not.
21:27 ^🔗	ez	Somebody2: the restoration problem is not related to lack of coordination, but having high replica count
21:27 ^🔗	ez	you need high amount of replicas anyway
21:27 ^🔗	Somebody2	Wait, how does having a high replica count affect restoration?
21:28 ^🔗	ez	well, you obviously need to encode with RS 255,255-85
21:28 ^🔗	Somebody2	What is RS 255,255-85?
21:28 ^🔗	ez	you need 85 things out of 255 to restore original 85 things
21:28 ^🔗	ez	in laymen terms 1of3 ECC code
21:28 ^🔗	Somebody2	OK...
21:28 ^🔗	ez	the numbers are high because you have insane statistic variance
21:29 ^🔗	Somebody2	But if you don't have a way to contact any of the replicas, you can't restore in any case.
21:29 ^🔗	ez	Somebody2: so say, iabak goes bust and now everyone needs to restore
21:29 ^🔗	ez	they open their closets
21:29 ^🔗	Somebody2	And if you do have a way to contact the replicas, that's coordination.
21:29 ^🔗	ez	and as long at least 1/3 of total what was backed up
21:29 ^🔗	ez	any 1/3
21:29 ^🔗	ez	you get the whole archive
21:30 ^🔗	Somebody2	Ah, so you contact the replicas AFTER THE FACT, when restoration is needed.
21:30 ^🔗	ez	Somebody2: they have useless shards of RS code
21:30 ^🔗	ez	they need to coordinate when restoring
21:30 ^🔗	ez	but not when backing up
21:30 ^🔗	Somebody2	I see. OK, yeah, I can see that being a possible improvement.
21:31 ^🔗	Somebody2	It does prevent being able to get any idea of the progress of the backup (without calling for restoration), though.
21:31 ^🔗	ez	Somebody2: it is indeed what-if, as im making different motivation assumptions than you have
21:31 ^🔗	ez	you have assumption that people are interested in seti-at-home online server/client architecture, which is fine
21:32 ^🔗	Somebody2	No, one of the explicit goals of the effort was to allow people to store the HDs offline, and plug in them in briefly once a month or so.
21:32 ^🔗	ez	hmm, thats neat
21:32 ^🔗	ez	though the spin ups could be still seen as a lot of bother
21:32 ^🔗	Somebody2	Yep!
21:33 ^🔗	ez	i mean theres a lot of emphasis on tracking replicas
21:33 ^🔗	ez	which makes no sense
21:33 ^🔗	Somebody2	But if you don't test backup media regularly, you should assume it's unrecoverable.
21:33 ^🔗	Somebody2	ez: Oh? Why does tracking replicas make no sense?
21:34 ^🔗	ez	when massive spread is done with erasure coding and you hit certain average replica count
21:34 ^🔗	ez	whether your drive failed or not doesnt matter as much
21:34 ^🔗	ez	your drive failure slightly lowered chance of recovery across a wide board
21:34 ^🔗	Somebody2	But you'd still need to report back average replica count in order to get progress reports, though.
21:35 ^🔗	ez	Somebody2: of course this makes wild assumptions that there is sufficient capacity, which there isnt
21:35 ^🔗	ez	for stochastic replication to work reasonably, you'd need to restrict the subset
21:35 ^🔗	ez	but you need to do that anyway to achieve uniform randomness
21:35 ^🔗	ez	(which is currently done with shards in fact)
21:36 ^🔗	ez	Somebody2: yea, it would need some fancy statistics
21:36 ^🔗	ez	like, knowing the split of "people keeping data online vs keeping it in closet"
21:36 ^🔗	ez	not sure how to arrive at that number
21:36 ^🔗	ez	but once you have it, you can infer total numbers
21:39 ^🔗	Somebody2	ez: I mean, don't you just need a count of "replicas (of any single shard)"?
21:39 ^🔗	ez	Somebody2: in practice, the stochastic domain would live each in their shard, yea
21:40 ^🔗	ez	so, a single volunteer would pick a shard, and start picking random items off it
21:40 ^🔗	ez	until through some vague quorum prototocol it is agreed the shard is sufficient
21:40 ^🔗	ez	then it moves onto another shard
21:41 ^🔗	ez	quorum can be fairly simple proofs of possesion. however it doesnt solve the closet problem in straighforward manner
21:42 ^🔗	ez	Somebody2: total progress would be metered in terms of shards which with observed sufficient online count (provided we know the closet number)
21:42 ^🔗	Somebody2	ez: Well, what we have now already does that...
21:42 ^🔗	ez	Somebody2: yep. the closet number cant be figured out easily without central authority
21:44 ^🔗	Somebody2	The closet number can't be found out at all.
21:44 ^🔗	ez	under assumptions most participants are honest about it, it can
21:45 ^🔗	ez	Somebody2: it is my understanding current sharding doesnt use erasure coding
21:45 ^🔗	Somebody2	ez: Not without asking people to plug in the HDs in their closet once a month.
21:45 ^🔗	ez	which worsens the situation a lot regarding closet
21:45 ^🔗	ez	as you have no wiggle room
21:45 ^🔗	Somebody2	Which is already what we do, I think.
21:45 ^🔗	Somebody2	The current sharding uses full mirroring, rather than erasure coding, yes (I think).
21:46 ^🔗	ez	Somebody2: one way to figure out closet might be indeed every 6-month or so check
21:46 ^🔗	Somebody2	But we do have multiple replicas of each shard
21:46 ^🔗	Somebody2	So there's the wiggle room.
21:46 ^🔗	Somebody2	I'm not sure how erasure coding would give us more.
21:46 ^🔗	ez	theres also the issue of inefficiency
21:47 ^🔗	ez	Somebody2: you're making assumptions that either whole shard disappears or not
21:47 ^🔗	ez	thats where inefficiency comes from
21:47 ^🔗	ez	in reality, only fragments of shard may disappear
21:48 ^🔗	ez	so any system, centralized or not, has to make sure that theres enough fragments in each shard to make the EC recoverable
21:48 ^🔗	Somebody2	ez: No, if we have 4 full copies of shard3, say -- and each one loses 15%; as long as all four didn't lose the same data, we can still recover all of it.
21:49 ^🔗	ez	Somebody2: yes
21:49 ^🔗	ez	first, the chance of 15% overlap is quite high
21:49 ^🔗	ez	second, you lost 15% across the board and already have high chance of failure
21:50 ^🔗	ez	and thats when using 4x more than you need to.
21:50 ^🔗	ez	with 1of3 you can lose 66% across the board, and still have full recoverability
21:51 ^🔗	Somebody2	ez: I see.
21:52 ^🔗	ez	(i still like full mirrors for the simplicity of it, and they do in fact perform better than RS on small sets)
21:52 ^🔗	ez	but RS with aggresive settings like 85 out of 255 works a bit like magic compared to that
21:52 ^🔗	Somebody2	And full mirroring also has the advantage of being transparent to the storage providers
21:53 ^🔗	Somebody2	So people don't have to hold data they don't want to
21:53 ^🔗	ez	yea, with RS everyone would have to hold "garbage" they cant recover without help of bunch of random folks
21:53 ^🔗	Somebody2	So that's why I still think the blocks to further progress on ia.bak are easier to install clients for more platforms, and promotion.
21:56 ^🔗	ez	Somebody2: its kinda moot point anyway, as RS, on big scales, can save, perhaps, 2x-3x storage compared to mirroring. its an improvement, but not vast improvement to warrant the complexity and issue you mention
21:56 ^🔗	Somebody2	Nods.
21:57 ^🔗	Somebody2	ez: Are you intersted/able to write improvements to our existing clients?
21:57 ^🔗	ez	honestly, im quite pessimistic about it
21:57 ^🔗	ez	no way in hell 100PB+ will appear out of thin air
21:58 ^🔗	ez	so im more like daydreaming to shift the paradigm way off, which could, perhaps work better
21:59 ^🔗	ez	rather than incremental improvements of current paradigm im fairly convinced cant be much improved on anymore
22:00 ^🔗		sep332 has joined #internetarchive.bak
22:05 ^🔗		sep332 has quit IRC (Read error: Operation timed out)
22:09 ^🔗	Somebody2	ez: You really think our current client programs can't be improved on?
22:10 ^🔗	Somebody2	Or do you think they can't be improved on enough to provide 100PB+ out of thin air?
22:10 ^🔗	ez	oh they definitely can, in terms of ux and all, youre entirely right
22:10 ^🔗	Somebody2	(which I agree with, but I don't think that's a reason not to improve them)
22:10 ^🔗	Somebody2	So, interested?
22:10 ^🔗	ez	its just that such an improvement could delivery, say a magnitude or so
22:10 ^🔗	ez	and am the sort of black-and-white all-or-nothing sort of guy
22:11 ^🔗	ez	if its 0.5% or 5%, its just awfuly not enough. the venue of asking government grants for it seems far more viable tbh
22:11 ^🔗	ez	but that doesnt warrant much of improvement on client side
22:15 ^🔗	ez	in terms of lobbying, heres an idea: business often liquidate hardware not worth operating (meaning to keep it online). instead of asking for a grant to buy new hardware, get something rolling in the vein of "ecologic disposal" of such hardware
22:15 ^🔗	Somebody2	Nice idea.
22:15 ^🔗	Somebody2	I sent the email to the Norwegian folks just now. Who knows how it will go, but it's done at least.
22:15 ^🔗	ez	am not sure if the logistics involved are worth it though. we're talking behemot NAS arrays with iscsi 250GB drives in it
22:19 ^🔗	ez	Somebody2: in any case, if a project specificaly targeting hardware much more prone to faults were involved, i'd participate to make a client with RS support
22:19 ^🔗	ez	cause mirroring becomes pretty inadequate with such an architecture
22:20 ^🔗	Somebody2	What hardware would that be?
22:21 ^🔗	ez	basically old hardware you keep off and power it on once a month, bring it all into one bunker, setup infra doing to the power-ons and check. the hardware and electricity costs are neglible, the majority of cost would be physical labor and rent for the bunker.
22:21 ^🔗	Somebody2	Please DO work on a client to support hardware like that!
22:23 ^🔗	ez	Somebody2: again, i can pinky pie on the software side, but this is still huge endeawor meatspace-wise
22:24 ^🔗	ez	basically some operator of the "enterprise scrapyard"
22:24 ^🔗	ez	am not even sure such an idea is practical, the hardware is extremely inefficient. think 1ton rack full of scrap = 10tb
22:25 ^🔗	ez	(thats the worst case tho, in practice its 100-500tb range)
22:28 ^🔗	ez	so basically shitload of space with not too much of flammable material around almost for free would be adequate. i cant really think of such a place, basically some sort of warehouse in middle of nowhere?
22:53 ^🔗		tuluu has joined #internetarchive.bak
22:56 ^🔗		tuluu has left
23:16 ^🔗	Somebody2	ez: Eh, if we have the software, it will make working on getting the hardware more attractive.
23:41 ^🔗	Senji_	It somewhat distresses me how easily the whole thing could be duplicated if money were just thrown at the problem
23:42 ^🔗		Senji_ is now known as Senji
23:45 ^🔗	Senji	At work, with our current systems, we could turn 100PB into 6500 m^3 of tapes in 10 years (with a little additional investment we could probably bring that 10 years down to 1 year easily.
23:46 ^🔗	Senji	I don't think we have space to store that many tapes, but ICBW
23:47 ^🔗	Senji	But we'd charge $2.5m a year for that
23:48 ^🔗	Senji	(thats two tape copies; I assume we'd charge about $1.5m for one tape copy)

irclogger-viewer