Time |
Nickname |
Message |
00:08
π
|
|
Madbrad has joined #archiveteam-ot |
01:13
π
|
|
jspiros__ has quit IRC () |
01:14
π
|
|
jspiros__ has joined #archiveteam-ot |
01:15
π
|
|
jspiros__ has quit IRC (Client Quit) |
01:16
π
|
|
jspiros__ has joined #archiveteam-ot |
01:22
π
|
|
killsushi has quit IRC (Quit: Leaving) |
02:12
π
|
|
ayanami_ has joined #archiveteam-ot |
02:21
π
|
|
BlueMax has quit IRC (Quit: Leaving) |
02:49
π
|
|
Odd0002_ has joined #archiveteam-ot |
02:50
π
|
|
Odd0002 has quit IRC (Read error: Operation timed out) |
02:50
π
|
|
Odd0002_ is now known as Odd0002 |
02:54
π
|
|
bytefray has quit IRC (Read error: Connection reset by peer) |
02:57
π
|
|
bytefray has joined #archiveteam-ot |
02:58
π
|
|
paul2520 has quit IRC (Read error: Operation timed out) |
03:04
π
|
|
dxrt_ has quit IRC (Read error: Connection reset by peer) |
03:07
π
|
|
BlueMax has joined #archiveteam-ot |
03:07
π
|
|
qw3rty113 has quit IRC (Ping timeout: 600 seconds) |
03:08
π
|
|
paul2520 has joined #archiveteam-ot |
03:08
π
|
|
step has quit IRC (Ping timeout: 600 seconds) |
03:09
π
|
|
kiska1 has quit IRC (Ping timeout: 600 seconds) |
03:10
π
|
|
qw3rty113 has joined #archiveteam-ot |
03:11
π
|
|
step has joined #archiveteam-ot |
03:12
π
|
|
kiska1 has joined #archiveteam-ot |
03:12
π
|
|
Fusl sets mode: +o kiska1 |
03:12
π
|
|
qw3rty114 has joined #archiveteam-ot |
03:13
π
|
|
dxrt_ has joined #archiveteam-ot |
03:15
π
|
|
killsushi has joined #archiveteam-ot |
03:20
π
|
|
qw3rty113 has quit IRC (Read error: Operation timed out) |
03:21
π
|
|
qw3rty115 has joined #archiveteam-ot |
03:25
π
|
|
step has quit IRC (Remote host closed the connection) |
03:27
π
|
|
qw3rty114 has quit IRC (Read error: Operation timed out) |
03:35
π
|
|
odemg has quit IRC (Ping timeout: 615 seconds) |
03:42
π
|
|
odemg has joined #archiveteam-ot |
04:08
π
|
|
ayanami_ has quit IRC (Quit: Leaving) |
05:22
π
|
t3 |
ivan: Oh. Okay. Thanks. |
05:23
π
|
t3 |
Kaz: Where did you get that graph? |
05:43
π
|
|
dhyan_nat has joined #archiveteam-ot |
06:06
π
|
|
JAA has quit IRC (Read error: Operation timed out) |
06:06
π
|
|
cfarquhar has quit IRC (Read error: Operation timed out) |
06:07
π
|
|
svchfoo1 has quit IRC (Read error: Operation timed out) |
06:08
π
|
|
simon816 has quit IRC (Read error: Operation timed out) |
06:12
π
|
|
lunik1 has quit IRC (Read error: Operation timed out) |
07:08
π
|
|
cfarquhar has joined #archiveteam-ot |
07:08
π
|
|
svchfoo1 has joined #archiveteam-ot |
07:08
π
|
|
simon816 has joined #archiveteam-ot |
07:08
π
|
|
lunik1 has joined #archiveteam-ot |
07:10
π
|
|
JAA has joined #archiveteam-ot |
07:11
π
|
|
Fusl sets mode: +o JAA |
07:11
π
|
|
bakJAA sets mode: +o JAA |
07:34
π
|
|
killsushi has quit IRC (Quit: Leaving) |
07:47
π
|
VoynichCr |
https://www.presstv.com/Detail/2019/04/19/593779/Google-Youtube-presstv-hispantv-channel-close |
08:11
π
|
|
Atom__ has joined #archiveteam-ot |
08:18
π
|
|
Atom-- has quit IRC (Read error: Operation timed out) |
08:41
π
|
|
antiufo has joined #archiveteam-ot |
09:25
π
|
Fusl |
t3: archive team grafana https://atdash.meo.ws/, the graph showing upload speed into IA from one of Kaz host |
09:25
π
|
Fusl |
sorry if grammar wrong just woke up |
10:24
π
|
|
VerifiedJ has joined #archiveteam-ot |
10:24
π
|
|
Verified_ has quit IRC (Ping timeout: 252 seconds) |
10:25
π
|
|
Verified_ has joined #archiveteam-ot |
10:26
π
|
|
antiufo has quit IRC (Quit: WeeChat 2.3) |
10:28
π
|
|
VerifiedJ has quit IRC (Ping timeout: 252 seconds) |
11:01
π
|
|
Oddly has quit IRC (Ping timeout: 360 seconds) |
11:12
π
|
|
bytefray has quit IRC (WeeChat 2.3) |
11:17
π
|
|
Verified_ has quit IRC (Ping timeout: 252 seconds) |
11:18
π
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
11:54
π
|
|
dhyan_nat has quit IRC (Read error: Operation timed out) |
11:59
π
|
|
Verified_ has joined #archiveteam-ot |
12:08
π
|
|
Tsuser has quit IRC (Ping timeout: 260 seconds) |
12:09
π
|
|
benjins has joined #archiveteam-ot |
13:20
π
|
|
Kenshin has joined #archiveteam-ot |
13:20
π
|
|
Fusl sets mode: +o Kenshin |
13:20
π
|
Kenshin |
Fusl: my guys are all traditional dedi/vps guys, no experience with openstack or ceph |
13:21
π
|
Fusl |
so the problem with ceph is, you dont really want to run with anything less than 3-5 nodes as it will cause more performance bottlenecks than standalone ZFS cluster per node does |
13:21
π
|
Kenshin |
we're looking at 3 nodes per "cluster" of sorts? |
13:21
π
|
|
jspiros__ has quit IRC () |
13:21
π
|
Kenshin |
3 copies of data |
13:21
π
|
Kenshin |
but trying to figure out what kind of network backbone |
13:21
π
|
Kenshin |
i tried out onapp storage for a bit in the past, hated it |
13:22
π
|
Fusl |
running quad or dual 10gbit is what i would recommend |
13:22
π
|
Fusl |
per node that is |
13:22
π
|
Kenshin |
the other question is whether we need that kind of speed |
13:22
π
|
Fusl |
doing dual 100gbit is what i do at home and it didnt increase the performance by a lot |
13:23
π
|
Kenshin |
the physical servers we're using are 8 bay E5 single cores |
13:23
π
|
Fusl |
well, you definitely do not want to go gigabit |
13:23
π
|
Kenshin |
*single processors |
13:24
π
|
Kenshin |
unless we do pure SSD, which is unlikely |
13:24
π
|
Fusl |
hard drives? multiple amount of hard drives by 1gbit and you get the required network speed to run a stable cluster |
13:24
π
|
Kenshin |
probably won't saturate a 2x10G |
13:24
π
|
Kenshin |
we're thinking of ssd+hdd mix per server |
13:24
π
|
Kenshin |
2/6 or 4/4 |
13:25
π
|
Fusl |
for ssd's it's more like 4gbit per ssd |
13:25
π
|
Fusl |
at least for sata |
13:25
π
|
Kenshin |
more likely 4/4, high capacity ssd + high cap hdd |
13:25
π
|
Fusl |
so 20gbit |
13:25
π
|
Kenshin |
most of our customers are still traditional cpanel hosting or ecommerce |
13:25
π
|
Fusl |
you can run dual 10gbit on that |
13:26
π
|
Kenshin |
link bundle? or two vlans |
13:26
π
|
Fusl |
and just let lacp layer 2+3+4 load balancing do the trick for you |
13:26
π
|
Kenshin |
ok that sounds good |
13:26
π
|
Kenshin |
compute nodes should also have 2x10G towards storage network right? |
13:26
π
|
Fusl |
yes |
13:27
π
|
Fusl |
just as a future note if you ever end up like i did |
13:27
π
|
Kenshin |
then + 2x10G public internet facing |
13:27
π
|
Fusl |
if you ever run a ceph cluster with more than around 10k nodes |
13:27
π
|
Fusl |
split them up into a separate vlan/network and run a secondary cluster |
13:27
π
|
Kenshin |
the heck? 10k? |
13:28
π
|
Kenshin |
amazon? |
13:28
π
|
Fusl |
nah, some private project i've been doing with a friend |
13:28
π
|
Fusl |
single OSD per ceph host |
13:28
π
|
Fusl |
ethernet/ceph drives |
13:28
π
|
Kenshin |
so 10 drives in a server = 10 ceph nodes? |
13:28
π
|
Fusl |
https://ceph.com/geen-categorie/500-osd-ceph-cluster/ |
13:29
π
|
Fusl |
10 drives in a server = 10 OSDs in 1 host |
13:29
π
|
Fusl |
each drive is called OSD |
13:30
π
|
Kenshin |
ic |
13:30
π
|
Kenshin |
so don't overdo the ceph nodes |
13:30
π
|
Kenshin |
gotcha |
13:31
π
|
Kenshin |
it's a relatively small setup |
13:31
π
|
Kenshin |
i got plenty of E3 microclouds (2bay), E5 single or dual procs with 8 bays |
13:32
π
|
Fusl |
whats the GHz on those cpus? |
13:32
π
|
Kenshin |
so plan is to convert some of these dedis into a proxmox + ceph cluster. probably just 2-3 racks worth at most |
13:33
π
|
Kenshin |
E3-1230V3 or V5, so 3.4Ghz x4, E5 2620 V3/V4 |
13:33
π
|
Fusl |
if you're running standalone ceph clusters segregated from the proxmox clusters, disable hyperthreading, vtx and vtd, that will give you at least 30% performance increase, at least the hyperthreading part |
13:33
π
|
Kenshin |
yeah separated, we have some units that are only single proc so 8 cores, planning to reuse them for pure ceph storage |
13:34
π
|
Fusl |
yeah that sounds good |
13:34
π
|
Kenshin |
the dual procs will be used for compute, as well as E3s for high Ghz compute |
13:34
π
|
Fusl |
how much memory does each node have? |
13:34
π
|
Kenshin |
E3s are stuck at 32G or 64G max, depending on DDR3/4 |
13:34
π
|
Fusl |
you'll see ceph eat around 2gb memory for a HDD and 4gb memory for an SSD OSD |
13:34
π
|
Kenshin |
oh cpeh, hmm |
13:35
π
|
Kenshin |
if we did 4x 2TB SSD + 4x 10TB HDD |
13:35
π
|
Fusl |
you can cut that down to around 1.5GB per OSD tho |
13:35
π
|
Kenshin |
what are we looking at? |
13:35
π
|
Fusl |
around 28ish gb memory usage |
13:35
π
|
Kenshin |
so 32G should be safe |
13:35
π
|
Fusl |
yep |
13:36
π
|
Kenshin |
ok cool, thanks |
13:36
π
|
Fusl |
and then, you'll see yourself play around with `rbd cache` on the proxmox side ceph.conf a lot |
13:36
π
|
Kenshin |
noob question but, how does scaling work? |
13:36
π
|
Kenshin |
add 3 more ceph nodes when we need more space? |
13:37
π
|
Fusl |
yeah, adding more OSDs |
13:37
π
|
Fusl |
they dont even have to be the same size |
13:37
π
|
Fusl |
thats the good thing about ceph, it will technically eat everything that you throw at it |
13:37
π
|
Kenshin |
what about balancing? |
13:37
π
|
Fusl |
it will automatically balance all objects around so they are equally distributed based on the size of the drives |
13:38
π
|
Fusl |
proxmox-side ceph.conf rbd stuff: http://xor.meo.ws/BgPBAf5FZztBkJPKrG5pMQ60hXRlVEYs.txt |
13:38
π
|
Fusl |
as for your SSD/HDD mixin |
13:38
π
|
Kenshin |
so assuming 3 hosts, 4 ssd, using ssd storage only. when we spin up an instance does it only use 1 drive per host? |
13:39
π
|
Fusl |
throw both into the same pool, don't run pool caching |
13:39
π
|
Fusl |
then go ahead and enable osd primary affinity on all ceph.conf ends |
13:39
π
|
Fusl |
then set HDD primary affinity to 0 |
13:39
π
|
Fusl |
this will cause all OSD ready to happen from the SSDs rather from the HDDs |
13:39
π
|
Fusl |
reads* |
13:40
π
|
Fusl |
and it will make your SSDs the primary OSD for all your objects |
13:40
π
|
Fusl |
You must enable βmon osd allow primary affinity = trueβ on the mons before you can adjust primary-affinity. note that older clients will no longer be able to communicate with the cluster. |
13:41
π
|
Fusl |
> when we spin up an instance does it only use 1 drive per host? |
13:41
π
|
Fusl |
can you elaborate on that question? |
13:42
π
|
Kenshin |
does it "raid0" across all available OSD on the host |
13:42
π
|
Kenshin |
sorry my ceph knowledge is very minimum |
13:42
π
|
Fusl |
that depends how you configure it |
13:42
π
|
Fusl |
so a normal, sane setup would be to set the RBD block size to 4MB and the replication size to 3 |
13:42
π
|
Kenshin |
my thinking is that it sounds like RAID1 over 3 physical nodes |
13:42
π
|
Fusl |
that will cause all your blocks to be written three times to three different OSD |
13:42
π
|
Kenshin |
but whether there's RAID0 within the host, no idea |
13:43
π
|
Fusl |
so that RBD 1MB block size i mentioned earlier |
13:43
π
|
Fusl |
is essentially the size that your RBD will be sliced up into chunks |
13:43
π
|
Fusl |
or "objects" |
13:44
π
|
Fusl |
because thats what they are in ceph |
13:44
π
|
Fusl |
"objects" |
13:44
π
|
Fusl |
so you have a 1MB object, that object lives distributed across three different OSDs, each on its own host |
13:44
π
|
Fusl |
if you configure it correctly, ceph will ensure that no more than one copy of the same block will live on the same host |
13:44
π
|
Fusl |
but it will live on a random OSD on that host |
13:44
π
|
Fusl |
that's what CRUSH map is for |
13:45
π
|
Kenshin |
ic, that makes sense |
13:45
π
|
Kenshin |
but god if ceph's database is fucked |
13:45
π
|
Kenshin |
the whole thing collapses |
13:45
π
|
Fusl |
it's a code, and ill give you an example shortly, that describes how your objects are distributed in the cluster |
13:45
π
|
Fusl |
there's no "database" |
13:45
π
|
Fusl |
its all just CRUSH |
13:45
π
|
Fusl |
so each host in the ceph cluster, each monitor, each manager, admin, etc. |
13:46
π
|
Fusl |
every client |
13:46
π
|
Fusl |
even the proxmox clients |
13:46
π
|
Fusl |
see the exact same CRUSH map |
13:46
π
|
Fusl |
and that CRUSH map is a hash calculation algorithm that tells the client where it has to store that data and how it distributes that across everything |
13:47
π
|
Kenshin |
and that map is stored somewhere? or dynamically generated? |
13:47
π
|
Fusl |
http://xor.meo.ws/e759Hlzp31ohMQzKGfAl3Rc8Rrq9oBnv.txt example crush map on one of my clusters |
13:47
π
|
Fusl |
this is the CRUSH map ^ |
13:47
π
|
Fusl |
its stored on the monitor servers |
13:47
π
|
Fusl |
you get that map |
13:47
π
|
Fusl |
you modify it |
13:47
π
|
Fusl |
and then you push that map into the cluster (the monitors) again |
13:47
π
|
Fusl |
each client always connects to the monitor servers first |
13:47
π
|
Fusl |
to figure out what the CRUSH map looks like |
13:47
π
|
Fusl |
and where the OSDs live, ip address, etc. |
13:48
π
|
Fusl |
and once thats done, tthe clients will connect to the OSDs whenever they need to |
13:48
π
|
Fusl |
and when that crush map changes |
13:48
π
|
Fusl |
for example when one of your OSDs gets offline |
13:48
π
|
Fusl |
or an entire host goes offline |
13:49
π
|
Fusl |
the managers will coordinate generating a temporary crush map that resembles a new map based on your static map but with the down OSDs removed from the calculations |
13:49
π
|
Fusl |
so that your clients always have a way to put the data somewhere, at least temporary |
13:49
π
|
Fusl |
so the monitors are the coordinators of the entire cluster |
13:49
π
|
Fusl |
run them on SSDs |
13:50
π
|
Fusl |
but run them on raid1 SSDs |
13:50
π
|
Fusl |
they dont need to be large |
13:50
π
|
Fusl |
16gb is everything they need |
13:50
π
|
Fusl |
but they need to be fast |
13:50
π
|
Fusl |
because they will do all the magic when somethign breaks or when you do maintenance |
13:50
π
|
Fusl |
and they like to live in a consensus |
13:50
π
|
Fusl |
so always have an uneven amount of monitors |
13:50
π
|
Fusl |
3,5,7,9 |
13:51
π
|
Fusl |
you are technically fine if you run the monitors on the same hosts where the OSDs live but they need to have dedicated SSDs |
13:51
π
|
Fusl |
anyways, i'm afk for 5 mins, ask away and ill answer them when im back |
14:02
π
|
Kaz |
sounds easier just to raid0 your production cluster /s |
14:02
π
|
Fusl |
map /dev/null, easiest |
14:02
π
|
Fusl |
and very good performance |
14:02
π
|
Fusl |
unbeatable |
14:10
π
|
Kenshin |
ok so a bit of reading up done, CRUSH is basically like a map of the entire cluster? |
14:10
π
|
Kenshin |
or at least where data is being stored |
14:11
π
|
Fusl |
if you wanna see it as that yes |
14:11
π
|
Fusl |
The CRUSH algorithm determines how to store and retrieve data by computing data storage locations. CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server or broker. With an algorithmically determined method of storing and retrieving data, Ceph avoids a single point of failure, a performance bottleneck, and a physical limit to its scalability. |
14:11
π
|
Fusl |
CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly store and retrieve data in OSDs with a uniform distribution of data across the cluster. For a detailed discussion of CRUSH, see CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data |
14:11
π
|
Fusl |
more on that: http://docs.ceph.com/docs/master/rados/operations/crush-map/ |
14:12
π
|
Kenshin |
right. so that will point the client to the correct nodes+osd right? |
14:12
π
|
Fusl |
for some things, the ceph documentation is REALLY worth a good read |
14:12
π
|
Fusl |
yes |
14:12
π
|
Kenshin |
thus bypassing a proxy and going direct to source |
14:12
π
|
Fusl |
exactly |
14:12
π
|
Kenshin |
but then say i have a 1GB data file, and 1MB block size |
14:13
π
|
Kenshin |
i know it's stored in 3 nodes, across 4 OSDs each |
14:13
π
|
Fusl |
so |
14:13
π
|
Kenshin |
but the OSDs also store other data |
14:13
π
|
Fusl |
see it as that |
14:13
π
|
Kenshin |
where's the data map stored? on the OSD itself? |
14:13
π
|
Fusl |
the 1gb volume |
14:14
π
|
Fusl |
will be sliced up into 1024 equal sized 1mb chunks. objects. |
14:14
π
|
Fusl |
all objects are distributed into several placement groups |
14:14
π
|
Fusl |
placement groups are essentially buckets that hold millions of objects |
14:15
π
|
Fusl |
and there are many placement groups, but there shouldn't be too many placement groups because they dont scale very well |
14:15
π
|
Fusl |
placement groups are stored on the OSDs |
14:16
π
|
Fusl |
thats how the stuff is distributed across all OSDs, by placement groups |
14:16
π
|
Fusl |
all the objects within the same placement group always stay within that placement group |
14:16
π
|
Fusl |
but the placement group is essentially what you replicate across different OSDs |
14:17
π
|
Kenshin |
so PG is like partitioning? |
14:17
π
|
Fusl |
kind of, yes |
14:17
π
|
Fusl |
so when one OSD does down because the disk dies, ceph does not look up what objects it needs to move based on the OSD id |
14:17
π
|
Fusl |
it looks up the placement groups that need to be moved |
14:17
π
|
Fusl |
and then it just moves all objects within that placement group once |
14:18
π
|
Kenshin |
self-healing? |
14:18
π
|
Fusl |
yes |
14:18
π
|
Kenshin |
this assumes one of the other OSDs on the same hosts has available space? |
14:18
π
|
Fusl |
correct |
14:18
π
|
Kenshin |
so PGs won't take up 100% of a volume |
14:18
π
|
Fusl |
it will automatically shrink your available disk space down to whatever is available after that OSD died |
14:19
π
|
Kenshin |
something like created as necessary |
14:19
π
|
Fusl |
lets assume you have a 4 node cluster, each having 8 OSDs, no matter what kind of OSDs |
14:20
π
|
Fusl |
and you have a placement group thats defined to be stored with size=3 (replicate/store the data 3 times) |
14:20
π
|
Fusl |
if one node of that cluster goes down, 25% of your data is essentially degraded |
14:20
π
|
Fusl |
it's still there because for most of the 75% of that data you still have two more copies available |
14:21
π
|
Fusl |
and thats where ceph will go in and automatically re-replicate all placement groups that have been lost |
14:21
π
|
Kenshin |
right, in simpler terms, self-healing |
14:21
π
|
Fusl |
and it will do so by balancing all the 25% of the lost placement groups to the rest of the 3 nodes |
14:21
π
|
Fusl |
yep |
14:21
π
|
Kenshin |
then when the node comes back up, auto rebalancing? |
14:21
π
|
Fusl |
yes |
14:21
π
|
Fusl |
when the node comes back |
14:22
π
|
Fusl |
you might have the data still stored there |
14:22
π
|
Kenshin |
but by assuming the node is empty/dirty |
14:22
π
|
Kenshin |
overwrite everything? |
14:22
π
|
Fusl |
it compares the version of the objects iin all the placement groups |
14:22
π
|
Fusl |
and then moves that data back into that node |
14:22
π
|
Fusl |
by merging the new data onto the old one |
14:22
π
|
Fusl |
and deleting old objects as necessary |
14:22
π
|
Kenshin |
oh, nice |
14:22
π
|
Kenshin |
so less data transferred |
14:22
π
|
|
jspiros__ has joined #archiveteam-ot |
14:22
π
|
Fusl |
yep |
14:23
π
|
Kenshin |
how is data cleaning handled? if an object is deleted, old data is zeroized? |
14:24
π
|
Fusl |
so up to a specific ceph version with bluestore OSDs, data is only unlinked from the disk and then later overwritten |
14:24
π
|
Fusl |
newer versions support TRIMming of OSDs so that your data is actually deleted from on-disk |
14:25
π
|
Fusl |
but as far as ceph disk utilization goes, if you delete objects, you're essentially freeing up the space |
14:25
π
|
Kenshin |
and this also means that assuming the client network able to handle it (2x100G uplink for example), it's actually able to retrieve data from across multiple hosts to read data |
14:25
π
|
Fusl |
correct |
14:25
π
|
Kenshin |
since it's technically distributed RAID0 on the storage level |
14:25
π
|
Kenshin |
assuming balancing is all done right |
14:25
π
|
Fusl |
correct |
14:25
π
|
Fusl |
there's just one pitfall |
14:26
π
|
Fusl |
placement groups are replicated as a kind-of primary/secondary way |
14:26
π
|
Fusl |
where there is one master OSD per placement group |
14:26
π
|
Fusl |
and all the others are followers |
14:26
π
|
Fusl |
so when your client goes ahead and reads an objects from a PG, it will always read from the primary OSD |
14:27
π
|
Kenshin |
but write is x3? |
14:27
π
|
Fusl |
yep |
14:27
π
|
Kenshin |
and if for some reason, write fails to 1 of 3? |
14:27
π
|
Fusl |
if it fails, the OSD will be kicked out of the cluster |
14:28
π
|
Fusl |
that will trigger a restart of that OSD if possible or completely mark that OSD as down |
14:28
π
|
Kenshin |
ic |
14:28
π
|
Kenshin |
so write is expensive, network wise |
14:28
π
|
Kenshin |
thus client needs to have sufficient bandwidth |
14:28
π
|
Fusl |
yup |
14:28
π
|
Fusl |
unles |
14:28
π
|
Fusl |
s |
14:28
π
|
Fusl |
erasure-coded pools :P |
14:29
π
|
Kaz |
going to chime in with a couple of q's: when master OSD fails, another just takes the master position? and is replication handled by the client, or the hosts? |
14:29
π
|
* |
Kaz is learning things |
14:30
π
|
Fusl |
if an OSD fails and happened to be a primary OSD for some PGs, the crush rule defines a new master by math |
14:30
π
|
Fusl |
so "primary" OSDs aren't really defined anywhere, they are just there by being calculated as such in the CRUSH map |
14:31
π
|
Fusl |
and if a PG's primary OSD goes down, the CRUSH map's calculations define in what order the secondary OSDs become master |
14:31
π
|
Kenshin |
so this also means, it's technically not 100% raid0 because reads are not load balanced across all the OSDs, but N/3 |
14:31
π
|
Fusl |
also |
14:32
π
|
Fusl |
data replication is not handled by the client itself |
14:32
π
|
Fusl |
its handled by the primary OSD |
14:32
π
|
Kaz |
as a background task or realtime? |
14:32
π
|
Fusl |
realtime |
14:32
π
|
Kaz |
ah ic |
14:32
π
|
Kaz |
ty |
14:32
π
|
Fusl |
it will only acknowledge a write back to the client once all OSDs have that write acknowledged |
14:33
π
|
Kenshin |
so client writes to primary OSD, primary OSD write to the other 2? |
14:33
π
|
Fusl |
technically it *is* 100% RAID0 for reads, just that instead of 128KB striping in an mdadm raid, its the object size that you stripe over all the different PGs and OSDs with |
14:33
π
|
Fusl |
Kenshin: yep |
14:33
π
|
Kenshin |
so client network is still 1:1 |
14:33
π
|
Fusl |
yes |
14:33
π
|
Kenshin |
so with your example, there are 4*8=32 OSDs |
14:34
π
|
Kenshin |
with a 3 copy setting |
14:34
π
|
Kenshin |
meaning 1 primary OSD per 3? |
14:34
π
|
Fusl |
yes |
14:34
π
|
Fusl |
there is always one primary OSD for each PG |
14:34
π
|
Kenshin |
so read only goes to the primary OSD for that specific read request |
14:34
π
|
Kenshin |
but it's distributed across multiple PG |
14:34
π
|
Fusl |
yes but no |
14:34
π
|
Kenshin |
so chances are, it'll still hit the other PGs? |
14:35
π
|
Fusl |
rbd_balance_parent_reads |
14:35
π
|
Kenshin |
i think i got confused, lol |
14:35
π
|
Fusl |
Description |
14:35
π
|
Fusl |
Ceph typically reads objects from the primary OSD. Since reads are immutable, you may enable this feature to balance parent reads between the primary OSD and the replicas. |
14:35
π
|
Kenshin |
so to get like godly read speeds, we can do that? |
14:35
π
|
Fusl |
yes |
14:35
π
|
Fusl |
i recommend running at least ceph luminous for that though |
14:36
π
|
Fusl |
any older version below that was known to corrupt data |
14:36
π
|
Fusl |
:P |
14:36
π
|
Kenshin |
does ceph deal with data integrity? |
14:36
π
|
Fusl |
since bluestore, yes |
14:36
π
|
Kenshin |
since that's something ZFS is supposedly very good in |
14:36
π
|
Fusl |
it checksums read data and it also does a background scrub |
14:37
π
|
Kenshin |
ic |
14:37
π
|
Fusl |
and if checksums are wrong, it will automatically re-replicate the data |
14:37
π
|
Kenshin |
how would it know who has the correct data though |
14:37
π
|
Fusl |
to counteract against bitrot |
14:37
π
|
Kenshin |
since there are 3 copies |
14:37
π
|
Fusl |
it stores the checksum in a table |
14:37
π
|
Kenshin |
ah ok |
14:37
π
|
Fusl |
each OSD stores its own data checksum |
14:37
π
|
Fusl |
and if that checksum differs from what it has stored |
14:38
π
|
Fusl |
it will ask the other OSDs for that checksum, if any other OSD has that checksum, it will copy that data over |
14:38
π
|
Kenshin |
ok great. i think it all makes sense now |
14:38
π
|
Kenshin |
back to something you mentioned earlier, why ssd + hdd in the same pool but turn off caching? |
14:38
π
|
Fusl |
and if no other OSD has that data, you'll have to do a manual recovery and god help you if you ever end up having to do that |
14:39
π
|
Fusl |
cache tiering |
14:39
π
|
Fusl |
makes only sense for REALLY fast, small SSDs |
14:39
π
|
Fusl |
like DC enterprise nvme SSDs |
14:39
π
|
Kenshin |
but say i want to sell VMs with both HDD and SSD volumes |
14:40
π
|
Fusl |
heh |
14:40
π
|
Fusl |
this is where it gets fun |
14:40
π
|
Fusl |
so |
14:40
π
|
Fusl |
ceph defines whats called |
14:40
π
|
Fusl |
pools |
14:40
π
|
Fusl |
and each pool can have a different CRUSH rule |
14:40
π
|
Fusl |
so in my crush example that i pasted earlier |
14:40
π
|
Fusl |
http://xor.meo.ws/e759Hlzp31ohMQzKGfAl3Rc8Rrq9oBnv.txt |
14:40
π
|
Fusl |
see how it has several "rule" blocks at the end? |
14:41
π
|
Fusl |
thats a crush rule |
14:41
π
|
Fusl |
and if you create or modify a ceph pool, you can set the crush rule that it has to use to replicate its own placement groups |
14:41
π
|
Fusl |
because placement groups in ceph are bound to pools |
14:41
π
|
Fusl |
so you can have a mix of HDDs and SSDs |
14:41
π
|
Fusl |
tag them differently |
14:41
π
|
Fusl |
with the class |
14:42
π
|
Fusl |
create a rule called "hdd", define that it should only use "hdd" class-tagged OSDs |
14:42
π
|
Fusl |
create another rule called "ssd", define that it should only use "ssd" class-tagged OSDs |
14:42
π
|
Fusl |
then create two different pools, one uses the hdd and the other one the ssd crush rule |
14:42
π
|
Fusl |
and then you just add those as two different pools in proxmox |
14:42
π
|
Fusl |
as two different storage backends |
14:42
π
|
Fusl |
that way you can select between SSD and HDD |
14:43
π
|
Kenshin |
right. so still technically 2 pools |
14:43
π
|
Kenshin |
but using rules to define |
14:43
π
|
Fusl |
and you can also have a third pool that says use ssds AND hdds |
14:43
π
|
Kenshin |
so tiered storage? |
14:43
π
|
Fusl |
nope |
14:43
π
|
Fusl |
just by specifying a third crush rule |
14:43
π
|
Fusl |
that says, use all OSDs |
14:44
π
|
Fusl |
like |
14:44
π
|
Kenshin |
"use whatever free space available i don't care"? |
14:44
π
|
Fusl |
dont limit what OSD class to use |
14:44
π
|
Fusl |
correct |
14:44
π
|
Fusl |
and you can still have cache tiering on top of that |
14:44
π
|
Fusl |
like, creating a 4th pool that says, use all SSDs |
14:44
π
|
Fusl |
and a 5th pool that says, use all HDDs use the 4th pool as cache tier |
14:45
π
|
Kenshin |
does the cache tiering really work though? |
14:45
π
|
Fusl |
yes |
14:45
π
|
Fusl |
because you define a pool overlay |
14:45
π
|
Kenshin |
but it allocates 200% of space? 100% of each? |
14:45
π
|
Fusl |
and also define how the data is overlayed between the pools |
14:45
π
|
Fusl |
nope |
14:45
π
|
Fusl |
you can tell it to drop cold data after a while from the 4th to the 5th pool |
14:45
π
|
Fusl |
http://docs.ceph.com/docs/master/rados/operations/cache-tiering/ |
14:47
π
|
Kenshin |
i assume you've tested it? what kind of use case would you recommend? |
14:47
π
|
Fusl |
i have tested it and |
14:47
π
|
Fusl |
corrupted my data |
14:48
π
|
Kenshin |
lol |
14:48
π
|
Kenshin |
ok, don't use. |
14:48
π
|
Fusl |
i am still not sure if i was just wayy too tired |
14:48
π
|
Kenshin |
i can't really think of a use case based on VMs |
14:48
π
|
Fusl |
/topic Today's agenda: Ceph crush-course Β―\_(γ)_/Β― |
14:48
π
|
Kenshin |
it just seems like stuff will get shuffled between HDD-SSD way too much |
14:49
π
|
Kenshin |
and it's not like it's file based, it's block based |
14:49
π
|
Kenshin |
which may generally make no sense at all |
14:49
π
|
Fusl |
things that are either read or write heavy or both would benefit from it |
14:49
π
|
Fusl |
but onyl if they keep hitting the same objects |
14:50
π
|
Fusl |
like, databases |
14:50
π
|
Fusl |
or websites |
14:50
π
|
Kenshin |
yeah the thing with database is that, mysql would write to the same file but when converted to block by filesystem it may not be the same exact block |
14:50
π
|
Kenshin |
that's my thinking where things would likely become really messy |
14:51
π
|
Fusl |
yep |
14:51
π
|
Kenshin |
if it's file storage, maybe S3 -> CEPH then it would make a lot of sense |
14:52
π
|
Fusl |
rados/S3 or cephfs would be another candidate, yeah |
14:52
π
|
Kenshin |
did you have the chance to test PCIE based SSDs with this? |
14:52
π
|
Fusl |
for caching? |
14:53
π
|
Kenshin |
cause my nodes are all 3.5" hdd slots, putting 2.5" ssds seems like a complete waste |
14:53
π
|
Kenshin |
for everything |
14:53
π
|
Kenshin |
i need to buy new SSD/HDD for this project anyway |
14:53
π
|
Fusl |
i dont have any pcie-based ssds but from what i heard, the performance is pretty good |
14:53
π
|
Kenshin |
the stuff i have in stock are all 256GB SSD or 1/2TB HDD due to dedi servers |
14:53
π
|
Kenshin |
instead of wasting 4 slots for SSD, might as well slap in a nice big PCIE SSD |
14:55
π
|
Fusl |
bcache is also another thing |
14:55
π
|
Fusl |
like, have bcache use nvme for caching and hdds for the cold storage devices |
14:56
π
|
Fusl |
and then point ceph to use the virtual bcache devices |
14:57
π
|
Kenshin |
hmm, that sounds like an idea |
14:57
π
|
Kenshin |
but that means i gotta partition the nvme (assuming qty < number of hdds) then attach them? |
14:58
π
|
Fusl |
yep |
14:58
π
|
Kenshin |
there's journals as well right |
14:59
π
|
Fusl |
oh yes |
14:59
π
|
Fusl |
you'll need to store them on nvme as well and then be careful about how you do that |
15:00
π
|
Fusl |
but bluestore journals on hard drives is essentially really good by now |
15:00
π
|
Kenshin |
so don't really need to do it on SSD |
15:00
π
|
Fusl |
so you can just put the journal onto the HDDs |
15:00
π
|
Kenshin |
less surprises |
15:01
π
|
Fusl |
unless you want to run filestore for which you have absolutely no good reason to |
15:01
π
|
|
jspiros is now known as jspiros_ |
15:15
π
|
Kenshin |
Fusl: if i use a PCI-E based 2TB SSD, would it be sufficient? |
15:15
π
|
Kenshin |
or the better question, how many partitions do i need? |
15:15
π
|
Fusl |
for journal or caching? |
15:15
π
|
Kenshin |
data |
15:15
π
|
Kenshin |
customer data |
15:15
π
|
Kenshin |
journal like you said, use the OSD itself |
15:15
π
|
Fusl |
yep |
15:15
π
|
Kenshin |
caching is risky |
15:16
π
|
Fusl |
2TB sounds fine |
15:16
π
|
Kenshin |
so just pure customer data |
15:16
π
|
Kenshin |
intel P4600 is either 2TB or 4TB |
15:16
π
|
Kenshin |
max speed 3200MB/sec read, 1575MB/sec write |
15:16
π
|
Fusl |
if you wanna go with the larger one i dont see a reason why you shouldn't |
15:16
π
|
Kenshin |
$$$ |
15:16
π
|
Kenshin |
need to buy 3 remember |
15:16
π
|
Kenshin |
lol |
15:16
π
|
Kenshin |
gets expensive |
15:16
π
|
Fusl |
yeah but |
15:16
π
|
Fusl |
more storage |
15:16
π
|
Fusl |
:P |
15:17
π
|
Kenshin |
question is whether i should bother with 2x2TB |
15:19
π
|
Fusl |
i wouldnt |
15:19
π
|
Fusl |
size/replication 2 is not recommended |
15:19
π
|
Fusl |
neither is 1 obviously |
15:21
π
|
Fusl |
do you mean 2x2TB per host? |
15:21
π
|
Fusl |
so 3x2x2TB? |
15:23
π
|
Kenshin |
2 PCIE cards per host, each 2TB |
15:23
π
|
Kenshin |
vs 1x 4TB |
15:24
π
|
Kenshin |
i get more bandwidth definitely, but bandwidth isn't an issue |
15:24
π
|
Kenshin |
since i'm stuck with 2x10G network |
16:19
π
|
|
Zerote_ has joined #archiveteam-ot |
16:25
π
|
|
chferfa has joined #archiveteam-ot |
16:57
π
|
|
Zerote_ has quit IRC (Ping timeout: 263 seconds) |
18:11
π
|
|
Zerote has joined #archiveteam-ot |
18:52
π
|
|
jspiros__ has quit IRC () |
19:01
π
|
|
Stiletto has quit IRC (Ping timeout: 252 seconds) |
19:04
π
|
|
Stiletto has joined #archiveteam-ot |
19:08
π
|
|
Stiletto has quit IRC (Ping timeout: 246 seconds) |
19:09
π
|
|
jspiros has joined #archiveteam-ot |
19:12
π
|
|
Stiletto has joined #archiveteam-ot |
19:17
π
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
20:29
π
|
|
t2t2 has quit IRC (Read error: Operation timed out) |
20:29
π
|
|
t2t2 has joined #archiveteam-ot |
21:51
π
|
|
ivan has quit IRC (Leaving) |
21:52
π
|
|
ivan has joined #archiveteam-ot |
21:55
π
|
|
ivan has quit IRC (Client Quit) |
21:57
π
|
|
ivan has joined #archiveteam-ot |
22:18
π
|
|
killsushi has joined #archiveteam-ot |
22:31
π
|
|
BlueMax has joined #archiveteam-ot |
23:30
π
|
|
m007a83_ has joined #archiveteam-ot |
23:32
π
|
|
m007a83 has quit IRC (Ping timeout: 252 seconds) |
23:53
π
|
Flashfire |
10 rotations IPs at his caravan park |