#archiveteam-ot 2019-04-20,Sat

↑back Search

Time Nickname Message
00:08 πŸ”— Madbrad has joined #archiveteam-ot
01:13 πŸ”— jspiros__ has quit IRC ()
01:14 πŸ”— jspiros__ has joined #archiveteam-ot
01:15 πŸ”— jspiros__ has quit IRC (Client Quit)
01:16 πŸ”— jspiros__ has joined #archiveteam-ot
01:22 πŸ”— killsushi has quit IRC (Quit: Leaving)
02:12 πŸ”— ayanami_ has joined #archiveteam-ot
02:21 πŸ”— BlueMax has quit IRC (Quit: Leaving)
02:49 πŸ”— Odd0002_ has joined #archiveteam-ot
02:50 πŸ”— Odd0002 has quit IRC (Read error: Operation timed out)
02:50 πŸ”— Odd0002_ is now known as Odd0002
02:54 πŸ”— bytefray has quit IRC (Read error: Connection reset by peer)
02:57 πŸ”— bytefray has joined #archiveteam-ot
02:58 πŸ”— paul2520 has quit IRC (Read error: Operation timed out)
03:04 πŸ”— dxrt_ has quit IRC (Read error: Connection reset by peer)
03:07 πŸ”— BlueMax has joined #archiveteam-ot
03:07 πŸ”— qw3rty113 has quit IRC (Ping timeout: 600 seconds)
03:08 πŸ”— paul2520 has joined #archiveteam-ot
03:08 πŸ”— step has quit IRC (Ping timeout: 600 seconds)
03:09 πŸ”— kiska1 has quit IRC (Ping timeout: 600 seconds)
03:10 πŸ”— qw3rty113 has joined #archiveteam-ot
03:11 πŸ”— step has joined #archiveteam-ot
03:12 πŸ”— kiska1 has joined #archiveteam-ot
03:12 πŸ”— Fusl sets mode: +o kiska1
03:12 πŸ”— qw3rty114 has joined #archiveteam-ot
03:13 πŸ”— dxrt_ has joined #archiveteam-ot
03:15 πŸ”— killsushi has joined #archiveteam-ot
03:20 πŸ”— qw3rty113 has quit IRC (Read error: Operation timed out)
03:21 πŸ”— qw3rty115 has joined #archiveteam-ot
03:25 πŸ”— step has quit IRC (Remote host closed the connection)
03:27 πŸ”— qw3rty114 has quit IRC (Read error: Operation timed out)
03:35 πŸ”— odemg has quit IRC (Ping timeout: 615 seconds)
03:42 πŸ”— odemg has joined #archiveteam-ot
04:08 πŸ”— ayanami_ has quit IRC (Quit: Leaving)
05:22 πŸ”— t3 ivan: Oh. Okay. Thanks.
05:23 πŸ”— t3 Kaz: Where did you get that graph?
05:43 πŸ”— dhyan_nat has joined #archiveteam-ot
06:06 πŸ”— JAA has quit IRC (Read error: Operation timed out)
06:06 πŸ”— cfarquhar has quit IRC (Read error: Operation timed out)
06:07 πŸ”— svchfoo1 has quit IRC (Read error: Operation timed out)
06:08 πŸ”— simon816 has quit IRC (Read error: Operation timed out)
06:12 πŸ”— lunik1 has quit IRC (Read error: Operation timed out)
07:08 πŸ”— cfarquhar has joined #archiveteam-ot
07:08 πŸ”— svchfoo1 has joined #archiveteam-ot
07:08 πŸ”— simon816 has joined #archiveteam-ot
07:08 πŸ”— lunik1 has joined #archiveteam-ot
07:10 πŸ”— JAA has joined #archiveteam-ot
07:11 πŸ”— Fusl sets mode: +o JAA
07:11 πŸ”— bakJAA sets mode: +o JAA
07:34 πŸ”— killsushi has quit IRC (Quit: Leaving)
07:47 πŸ”— VoynichCr https://www.presstv.com/Detail/2019/04/19/593779/Google-Youtube-presstv-hispantv-channel-close
08:11 πŸ”— Atom__ has joined #archiveteam-ot
08:18 πŸ”— Atom-- has quit IRC (Read error: Operation timed out)
08:41 πŸ”— antiufo has joined #archiveteam-ot
09:25 πŸ”— Fusl t3: archive team grafana https://atdash.meo.ws/, the graph showing upload speed into IA from one of Kaz host
09:25 πŸ”— Fusl sorry if grammar wrong just woke up
10:24 πŸ”— VerifiedJ has joined #archiveteam-ot
10:24 πŸ”— Verified_ has quit IRC (Ping timeout: 252 seconds)
10:25 πŸ”— Verified_ has joined #archiveteam-ot
10:26 πŸ”— antiufo has quit IRC (Quit: WeeChat 2.3)
10:28 πŸ”— VerifiedJ has quit IRC (Ping timeout: 252 seconds)
11:01 πŸ”— Oddly has quit IRC (Ping timeout: 360 seconds)
11:12 πŸ”— bytefray has quit IRC (WeeChat 2.3)
11:17 πŸ”— Verified_ has quit IRC (Ping timeout: 252 seconds)
11:18 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
11:54 πŸ”— dhyan_nat has quit IRC (Read error: Operation timed out)
11:59 πŸ”— Verified_ has joined #archiveteam-ot
12:08 πŸ”— Tsuser has quit IRC (Ping timeout: 260 seconds)
12:09 πŸ”— benjins has joined #archiveteam-ot
13:20 πŸ”— Kenshin has joined #archiveteam-ot
13:20 πŸ”— Fusl sets mode: +o Kenshin
13:20 πŸ”— Kenshin Fusl: my guys are all traditional dedi/vps guys, no experience with openstack or ceph
13:21 πŸ”— Fusl so the problem with ceph is, you dont really want to run with anything less than 3-5 nodes as it will cause more performance bottlenecks than standalone ZFS cluster per node does
13:21 πŸ”— Kenshin we're looking at 3 nodes per "cluster" of sorts?
13:21 πŸ”— jspiros__ has quit IRC ()
13:21 πŸ”— Kenshin 3 copies of data
13:21 πŸ”— Kenshin but trying to figure out what kind of network backbone
13:21 πŸ”— Kenshin i tried out onapp storage for a bit in the past, hated it
13:22 πŸ”— Fusl running quad or dual 10gbit is what i would recommend
13:22 πŸ”— Fusl per node that is
13:22 πŸ”— Kenshin the other question is whether we need that kind of speed
13:22 πŸ”— Fusl doing dual 100gbit is what i do at home and it didnt increase the performance by a lot
13:23 πŸ”— Kenshin the physical servers we're using are 8 bay E5 single cores
13:23 πŸ”— Fusl well, you definitely do not want to go gigabit
13:23 πŸ”— Kenshin *single processors
13:24 πŸ”— Kenshin unless we do pure SSD, which is unlikely
13:24 πŸ”— Fusl hard drives? multiple amount of hard drives by 1gbit and you get the required network speed to run a stable cluster
13:24 πŸ”— Kenshin probably won't saturate a 2x10G
13:24 πŸ”— Kenshin we're thinking of ssd+hdd mix per server
13:24 πŸ”— Kenshin 2/6 or 4/4
13:25 πŸ”— Fusl for ssd's it's more like 4gbit per ssd
13:25 πŸ”— Fusl at least for sata
13:25 πŸ”— Kenshin more likely 4/4, high capacity ssd + high cap hdd
13:25 πŸ”— Fusl so 20gbit
13:25 πŸ”— Kenshin most of our customers are still traditional cpanel hosting or ecommerce
13:25 πŸ”— Fusl you can run dual 10gbit on that
13:26 πŸ”— Kenshin link bundle? or two vlans
13:26 πŸ”— Fusl and just let lacp layer 2+3+4 load balancing do the trick for you
13:26 πŸ”— Kenshin ok that sounds good
13:26 πŸ”— Kenshin compute nodes should also have 2x10G towards storage network right?
13:26 πŸ”— Fusl yes
13:27 πŸ”— Fusl just as a future note if you ever end up like i did
13:27 πŸ”— Kenshin then + 2x10G public internet facing
13:27 πŸ”— Fusl if you ever run a ceph cluster with more than around 10k nodes
13:27 πŸ”— Fusl split them up into a separate vlan/network and run a secondary cluster
13:27 πŸ”— Kenshin the heck? 10k?
13:28 πŸ”— Kenshin amazon?
13:28 πŸ”— Fusl nah, some private project i've been doing with a friend
13:28 πŸ”— Fusl single OSD per ceph host
13:28 πŸ”— Fusl ethernet/ceph drives
13:28 πŸ”— Kenshin so 10 drives in a server = 10 ceph nodes?
13:28 πŸ”— Fusl https://ceph.com/geen-categorie/500-osd-ceph-cluster/
13:29 πŸ”— Fusl 10 drives in a server = 10 OSDs in 1 host
13:29 πŸ”— Fusl each drive is called OSD
13:30 πŸ”— Kenshin ic
13:30 πŸ”— Kenshin so don't overdo the ceph nodes
13:30 πŸ”— Kenshin gotcha
13:31 πŸ”— Kenshin it's a relatively small setup
13:31 πŸ”— Kenshin i got plenty of E3 microclouds (2bay), E5 single or dual procs with 8 bays
13:32 πŸ”— Fusl whats the GHz on those cpus?
13:32 πŸ”— Kenshin so plan is to convert some of these dedis into a proxmox + ceph cluster. probably just 2-3 racks worth at most
13:33 πŸ”— Kenshin E3-1230V3 or V5, so 3.4Ghz x4, E5 2620 V3/V4
13:33 πŸ”— Fusl if you're running standalone ceph clusters segregated from the proxmox clusters, disable hyperthreading, vtx and vtd, that will give you at least 30% performance increase, at least the hyperthreading part
13:33 πŸ”— Kenshin yeah separated, we have some units that are only single proc so 8 cores, planning to reuse them for pure ceph storage
13:34 πŸ”— Fusl yeah that sounds good
13:34 πŸ”— Kenshin the dual procs will be used for compute, as well as E3s for high Ghz compute
13:34 πŸ”— Fusl how much memory does each node have?
13:34 πŸ”— Kenshin E3s are stuck at 32G or 64G max, depending on DDR3/4
13:34 πŸ”— Fusl you'll see ceph eat around 2gb memory for a HDD and 4gb memory for an SSD OSD
13:34 πŸ”— Kenshin oh cpeh, hmm
13:35 πŸ”— Kenshin if we did 4x 2TB SSD + 4x 10TB HDD
13:35 πŸ”— Fusl you can cut that down to around 1.5GB per OSD tho
13:35 πŸ”— Kenshin what are we looking at?
13:35 πŸ”— Fusl around 28ish gb memory usage
13:35 πŸ”— Kenshin so 32G should be safe
13:35 πŸ”— Fusl yep
13:36 πŸ”— Kenshin ok cool, thanks
13:36 πŸ”— Fusl and then, you'll see yourself play around with `rbd cache` on the proxmox side ceph.conf a lot
13:36 πŸ”— Kenshin noob question but, how does scaling work?
13:36 πŸ”— Kenshin add 3 more ceph nodes when we need more space?
13:37 πŸ”— Fusl yeah, adding more OSDs
13:37 πŸ”— Fusl they dont even have to be the same size
13:37 πŸ”— Fusl thats the good thing about ceph, it will technically eat everything that you throw at it
13:37 πŸ”— Kenshin what about balancing?
13:37 πŸ”— Fusl it will automatically balance all objects around so they are equally distributed based on the size of the drives
13:38 πŸ”— Fusl proxmox-side ceph.conf rbd stuff: http://xor.meo.ws/BgPBAf5FZztBkJPKrG5pMQ60hXRlVEYs.txt
13:38 πŸ”— Fusl as for your SSD/HDD mixin
13:38 πŸ”— Kenshin so assuming 3 hosts, 4 ssd, using ssd storage only. when we spin up an instance does it only use 1 drive per host?
13:39 πŸ”— Fusl throw both into the same pool, don't run pool caching
13:39 πŸ”— Fusl then go ahead and enable osd primary affinity on all ceph.conf ends
13:39 πŸ”— Fusl then set HDD primary affinity to 0
13:39 πŸ”— Fusl this will cause all OSD ready to happen from the SSDs rather from the HDDs
13:39 πŸ”— Fusl reads*
13:40 πŸ”— Fusl and it will make your SSDs the primary OSD for all your objects
13:40 πŸ”— Fusl You must enable β€˜mon osd allow primary affinity = true’ on the mons before you can adjust primary-affinity. note that older clients will no longer be able to communicate with the cluster.
13:41 πŸ”— Fusl > when we spin up an instance does it only use 1 drive per host?
13:41 πŸ”— Fusl can you elaborate on that question?
13:42 πŸ”— Kenshin does it "raid0" across all available OSD on the host
13:42 πŸ”— Kenshin sorry my ceph knowledge is very minimum
13:42 πŸ”— Fusl that depends how you configure it
13:42 πŸ”— Fusl so a normal, sane setup would be to set the RBD block size to 4MB and the replication size to 3
13:42 πŸ”— Kenshin my thinking is that it sounds like RAID1 over 3 physical nodes
13:42 πŸ”— Fusl that will cause all your blocks to be written three times to three different OSD
13:42 πŸ”— Kenshin but whether there's RAID0 within the host, no idea
13:43 πŸ”— Fusl so that RBD 1MB block size i mentioned earlier
13:43 πŸ”— Fusl is essentially the size that your RBD will be sliced up into chunks
13:43 πŸ”— Fusl or "objects"
13:44 πŸ”— Fusl because thats what they are in ceph
13:44 πŸ”— Fusl "objects"
13:44 πŸ”— Fusl so you have a 1MB object, that object lives distributed across three different OSDs, each on its own host
13:44 πŸ”— Fusl if you configure it correctly, ceph will ensure that no more than one copy of the same block will live on the same host
13:44 πŸ”— Fusl but it will live on a random OSD on that host
13:44 πŸ”— Fusl that's what CRUSH map is for
13:45 πŸ”— Kenshin ic, that makes sense
13:45 πŸ”— Kenshin but god if ceph's database is fucked
13:45 πŸ”— Kenshin the whole thing collapses
13:45 πŸ”— Fusl it's a code, and ill give you an example shortly, that describes how your objects are distributed in the cluster
13:45 πŸ”— Fusl there's no "database"
13:45 πŸ”— Fusl its all just CRUSH
13:45 πŸ”— Fusl so each host in the ceph cluster, each monitor, each manager, admin, etc.
13:46 πŸ”— Fusl every client
13:46 πŸ”— Fusl even the proxmox clients
13:46 πŸ”— Fusl see the exact same CRUSH map
13:46 πŸ”— Fusl and that CRUSH map is a hash calculation algorithm that tells the client where it has to store that data and how it distributes that across everything
13:47 πŸ”— Kenshin and that map is stored somewhere? or dynamically generated?
13:47 πŸ”— Fusl http://xor.meo.ws/e759Hlzp31ohMQzKGfAl3Rc8Rrq9oBnv.txt example crush map on one of my clusters
13:47 πŸ”— Fusl this is the CRUSH map ^
13:47 πŸ”— Fusl its stored on the monitor servers
13:47 πŸ”— Fusl you get that map
13:47 πŸ”— Fusl you modify it
13:47 πŸ”— Fusl and then you push that map into the cluster (the monitors) again
13:47 πŸ”— Fusl each client always connects to the monitor servers first
13:47 πŸ”— Fusl to figure out what the CRUSH map looks like
13:47 πŸ”— Fusl and where the OSDs live, ip address, etc.
13:48 πŸ”— Fusl and once thats done, tthe clients will connect to the OSDs whenever they need to
13:48 πŸ”— Fusl and when that crush map changes
13:48 πŸ”— Fusl for example when one of your OSDs gets offline
13:48 πŸ”— Fusl or an entire host goes offline
13:49 πŸ”— Fusl the managers will coordinate generating a temporary crush map that resembles a new map based on your static map but with the down OSDs removed from the calculations
13:49 πŸ”— Fusl so that your clients always have a way to put the data somewhere, at least temporary
13:49 πŸ”— Fusl so the monitors are the coordinators of the entire cluster
13:49 πŸ”— Fusl run them on SSDs
13:50 πŸ”— Fusl but run them on raid1 SSDs
13:50 πŸ”— Fusl they dont need to be large
13:50 πŸ”— Fusl 16gb is everything they need
13:50 πŸ”— Fusl but they need to be fast
13:50 πŸ”— Fusl because they will do all the magic when somethign breaks or when you do maintenance
13:50 πŸ”— Fusl and they like to live in a consensus
13:50 πŸ”— Fusl so always have an uneven amount of monitors
13:50 πŸ”— Fusl 3,5,7,9
13:51 πŸ”— Fusl you are technically fine if you run the monitors on the same hosts where the OSDs live but they need to have dedicated SSDs
13:51 πŸ”— Fusl anyways, i'm afk for 5 mins, ask away and ill answer them when im back
14:02 πŸ”— Kaz sounds easier just to raid0 your production cluster /s
14:02 πŸ”— Fusl map /dev/null, easiest
14:02 πŸ”— Fusl and very good performance
14:02 πŸ”— Fusl unbeatable
14:10 πŸ”— Kenshin ok so a bit of reading up done, CRUSH is basically like a map of the entire cluster?
14:10 πŸ”— Kenshin or at least where data is being stored
14:11 πŸ”— Fusl if you wanna see it as that yes
14:11 πŸ”— Fusl The CRUSH algorithm determines how to store and retrieve data by computing data storage locations. CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server or broker. With an algorithmically determined method of storing and retrieving data, Ceph avoids a single point of failure, a performance bottleneck, and a physical limit to its scalability.
14:11 πŸ”— Fusl CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly store and retrieve data in OSDs with a uniform distribution of data across the cluster. For a detailed discussion of CRUSH, see CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data
14:11 πŸ”— Fusl more on that: http://docs.ceph.com/docs/master/rados/operations/crush-map/
14:12 πŸ”— Kenshin right. so that will point the client to the correct nodes+osd right?
14:12 πŸ”— Fusl for some things, the ceph documentation is REALLY worth a good read
14:12 πŸ”— Fusl yes
14:12 πŸ”— Kenshin thus bypassing a proxy and going direct to source
14:12 πŸ”— Fusl exactly
14:12 πŸ”— Kenshin but then say i have a 1GB data file, and 1MB block size
14:13 πŸ”— Kenshin i know it's stored in 3 nodes, across 4 OSDs each
14:13 πŸ”— Fusl so
14:13 πŸ”— Kenshin but the OSDs also store other data
14:13 πŸ”— Fusl see it as that
14:13 πŸ”— Kenshin where's the data map stored? on the OSD itself?
14:13 πŸ”— Fusl the 1gb volume
14:14 πŸ”— Fusl will be sliced up into 1024 equal sized 1mb chunks. objects.
14:14 πŸ”— Fusl all objects are distributed into several placement groups
14:14 πŸ”— Fusl placement groups are essentially buckets that hold millions of objects
14:15 πŸ”— Fusl and there are many placement groups, but there shouldn't be too many placement groups because they dont scale very well
14:15 πŸ”— Fusl placement groups are stored on the OSDs
14:16 πŸ”— Fusl thats how the stuff is distributed across all OSDs, by placement groups
14:16 πŸ”— Fusl all the objects within the same placement group always stay within that placement group
14:16 πŸ”— Fusl but the placement group is essentially what you replicate across different OSDs
14:17 πŸ”— Kenshin so PG is like partitioning?
14:17 πŸ”— Fusl kind of, yes
14:17 πŸ”— Fusl so when one OSD does down because the disk dies, ceph does not look up what objects it needs to move based on the OSD id
14:17 πŸ”— Fusl it looks up the placement groups that need to be moved
14:17 πŸ”— Fusl and then it just moves all objects within that placement group once
14:18 πŸ”— Kenshin self-healing?
14:18 πŸ”— Fusl yes
14:18 πŸ”— Kenshin this assumes one of the other OSDs on the same hosts has available space?
14:18 πŸ”— Fusl correct
14:18 πŸ”— Kenshin so PGs won't take up 100% of a volume
14:18 πŸ”— Fusl it will automatically shrink your available disk space down to whatever is available after that OSD died
14:19 πŸ”— Kenshin something like created as necessary
14:19 πŸ”— Fusl lets assume you have a 4 node cluster, each having 8 OSDs, no matter what kind of OSDs
14:20 πŸ”— Fusl and you have a placement group thats defined to be stored with size=3 (replicate/store the data 3 times)
14:20 πŸ”— Fusl if one node of that cluster goes down, 25% of your data is essentially degraded
14:20 πŸ”— Fusl it's still there because for most of the 75% of that data you still have two more copies available
14:21 πŸ”— Fusl and thats where ceph will go in and automatically re-replicate all placement groups that have been lost
14:21 πŸ”— Kenshin right, in simpler terms, self-healing
14:21 πŸ”— Fusl and it will do so by balancing all the 25% of the lost placement groups to the rest of the 3 nodes
14:21 πŸ”— Fusl yep
14:21 πŸ”— Kenshin then when the node comes back up, auto rebalancing?
14:21 πŸ”— Fusl yes
14:21 πŸ”— Fusl when the node comes back
14:22 πŸ”— Fusl you might have the data still stored there
14:22 πŸ”— Kenshin but by assuming the node is empty/dirty
14:22 πŸ”— Kenshin overwrite everything?
14:22 πŸ”— Fusl it compares the version of the objects iin all the placement groups
14:22 πŸ”— Fusl and then moves that data back into that node
14:22 πŸ”— Fusl by merging the new data onto the old one
14:22 πŸ”— Fusl and deleting old objects as necessary
14:22 πŸ”— Kenshin oh, nice
14:22 πŸ”— Kenshin so less data transferred
14:22 πŸ”— jspiros__ has joined #archiveteam-ot
14:22 πŸ”— Fusl yep
14:23 πŸ”— Kenshin how is data cleaning handled? if an object is deleted, old data is zeroized?
14:24 πŸ”— Fusl so up to a specific ceph version with bluestore OSDs, data is only unlinked from the disk and then later overwritten
14:24 πŸ”— Fusl newer versions support TRIMming of OSDs so that your data is actually deleted from on-disk
14:25 πŸ”— Fusl but as far as ceph disk utilization goes, if you delete objects, you're essentially freeing up the space
14:25 πŸ”— Kenshin and this also means that assuming the client network able to handle it (2x100G uplink for example), it's actually able to retrieve data from across multiple hosts to read data
14:25 πŸ”— Fusl correct
14:25 πŸ”— Kenshin since it's technically distributed RAID0 on the storage level
14:25 πŸ”— Kenshin assuming balancing is all done right
14:25 πŸ”— Fusl correct
14:25 πŸ”— Fusl there's just one pitfall
14:26 πŸ”— Fusl placement groups are replicated as a kind-of primary/secondary way
14:26 πŸ”— Fusl where there is one master OSD per placement group
14:26 πŸ”— Fusl and all the others are followers
14:26 πŸ”— Fusl so when your client goes ahead and reads an objects from a PG, it will always read from the primary OSD
14:27 πŸ”— Kenshin but write is x3?
14:27 πŸ”— Fusl yep
14:27 πŸ”— Kenshin and if for some reason, write fails to 1 of 3?
14:27 πŸ”— Fusl if it fails, the OSD will be kicked out of the cluster
14:28 πŸ”— Fusl that will trigger a restart of that OSD if possible or completely mark that OSD as down
14:28 πŸ”— Kenshin ic
14:28 πŸ”— Kenshin so write is expensive, network wise
14:28 πŸ”— Kenshin thus client needs to have sufficient bandwidth
14:28 πŸ”— Fusl yup
14:28 πŸ”— Fusl unles
14:28 πŸ”— Fusl s
14:28 πŸ”— Fusl erasure-coded pools :P
14:29 πŸ”— Kaz going to chime in with a couple of q's: when master OSD fails, another just takes the master position? and is replication handled by the client, or the hosts?
14:29 πŸ”— * Kaz is learning things
14:30 πŸ”— Fusl if an OSD fails and happened to be a primary OSD for some PGs, the crush rule defines a new master by math
14:30 πŸ”— Fusl so "primary" OSDs aren't really defined anywhere, they are just there by being calculated as such in the CRUSH map
14:31 πŸ”— Fusl and if a PG's primary OSD goes down, the CRUSH map's calculations define in what order the secondary OSDs become master
14:31 πŸ”— Kenshin so this also means, it's technically not 100% raid0 because reads are not load balanced across all the OSDs, but N/3
14:31 πŸ”— Fusl also
14:32 πŸ”— Fusl data replication is not handled by the client itself
14:32 πŸ”— Fusl its handled by the primary OSD
14:32 πŸ”— Kaz as a background task or realtime?
14:32 πŸ”— Fusl realtime
14:32 πŸ”— Kaz ah ic
14:32 πŸ”— Kaz ty
14:32 πŸ”— Fusl it will only acknowledge a write back to the client once all OSDs have that write acknowledged
14:33 πŸ”— Kenshin so client writes to primary OSD, primary OSD write to the other 2?
14:33 πŸ”— Fusl technically it *is* 100% RAID0 for reads, just that instead of 128KB striping in an mdadm raid, its the object size that you stripe over all the different PGs and OSDs with
14:33 πŸ”— Fusl Kenshin: yep
14:33 πŸ”— Kenshin so client network is still 1:1
14:33 πŸ”— Fusl yes
14:33 πŸ”— Kenshin so with your example, there are 4*8=32 OSDs
14:34 πŸ”— Kenshin with a 3 copy setting
14:34 πŸ”— Kenshin meaning 1 primary OSD per 3?
14:34 πŸ”— Fusl yes
14:34 πŸ”— Fusl there is always one primary OSD for each PG
14:34 πŸ”— Kenshin so read only goes to the primary OSD for that specific read request
14:34 πŸ”— Kenshin but it's distributed across multiple PG
14:34 πŸ”— Fusl yes but no
14:34 πŸ”— Kenshin so chances are, it'll still hit the other PGs?
14:35 πŸ”— Fusl rbd_balance_parent_reads
14:35 πŸ”— Kenshin i think i got confused, lol
14:35 πŸ”— Fusl Description
14:35 πŸ”— Fusl Ceph typically reads objects from the primary OSD. Since reads are immutable, you may enable this feature to balance parent reads between the primary OSD and the replicas.
14:35 πŸ”— Kenshin so to get like godly read speeds, we can do that?
14:35 πŸ”— Fusl yes
14:35 πŸ”— Fusl i recommend running at least ceph luminous for that though
14:36 πŸ”— Fusl any older version below that was known to corrupt data
14:36 πŸ”— Fusl :P
14:36 πŸ”— Kenshin does ceph deal with data integrity?
14:36 πŸ”— Fusl since bluestore, yes
14:36 πŸ”— Kenshin since that's something ZFS is supposedly very good in
14:36 πŸ”— Fusl it checksums read data and it also does a background scrub
14:37 πŸ”— Kenshin ic
14:37 πŸ”— Fusl and if checksums are wrong, it will automatically re-replicate the data
14:37 πŸ”— Kenshin how would it know who has the correct data though
14:37 πŸ”— Fusl to counteract against bitrot
14:37 πŸ”— Kenshin since there are 3 copies
14:37 πŸ”— Fusl it stores the checksum in a table
14:37 πŸ”— Kenshin ah ok
14:37 πŸ”— Fusl each OSD stores its own data checksum
14:37 πŸ”— Fusl and if that checksum differs from what it has stored
14:38 πŸ”— Fusl it will ask the other OSDs for that checksum, if any other OSD has that checksum, it will copy that data over
14:38 πŸ”— Kenshin ok great. i think it all makes sense now
14:38 πŸ”— Kenshin back to something you mentioned earlier, why ssd + hdd in the same pool but turn off caching?
14:38 πŸ”— Fusl and if no other OSD has that data, you'll have to do a manual recovery and god help you if you ever end up having to do that
14:39 πŸ”— Fusl cache tiering
14:39 πŸ”— Fusl makes only sense for REALLY fast, small SSDs
14:39 πŸ”— Fusl like DC enterprise nvme SSDs
14:39 πŸ”— Kenshin but say i want to sell VMs with both HDD and SSD volumes
14:40 πŸ”— Fusl heh
14:40 πŸ”— Fusl this is where it gets fun
14:40 πŸ”— Fusl so
14:40 πŸ”— Fusl ceph defines whats called
14:40 πŸ”— Fusl pools
14:40 πŸ”— Fusl and each pool can have a different CRUSH rule
14:40 πŸ”— Fusl so in my crush example that i pasted earlier
14:40 πŸ”— Fusl http://xor.meo.ws/e759Hlzp31ohMQzKGfAl3Rc8Rrq9oBnv.txt
14:40 πŸ”— Fusl see how it has several "rule" blocks at the end?
14:41 πŸ”— Fusl thats a crush rule
14:41 πŸ”— Fusl and if you create or modify a ceph pool, you can set the crush rule that it has to use to replicate its own placement groups
14:41 πŸ”— Fusl because placement groups in ceph are bound to pools
14:41 πŸ”— Fusl so you can have a mix of HDDs and SSDs
14:41 πŸ”— Fusl tag them differently
14:41 πŸ”— Fusl with the class
14:42 πŸ”— Fusl create a rule called "hdd", define that it should only use "hdd" class-tagged OSDs
14:42 πŸ”— Fusl create another rule called "ssd", define that it should only use "ssd" class-tagged OSDs
14:42 πŸ”— Fusl then create two different pools, one uses the hdd and the other one the ssd crush rule
14:42 πŸ”— Fusl and then you just add those as two different pools in proxmox
14:42 πŸ”— Fusl as two different storage backends
14:42 πŸ”— Fusl that way you can select between SSD and HDD
14:43 πŸ”— Kenshin right. so still technically 2 pools
14:43 πŸ”— Kenshin but using rules to define
14:43 πŸ”— Fusl and you can also have a third pool that says use ssds AND hdds
14:43 πŸ”— Kenshin so tiered storage?
14:43 πŸ”— Fusl nope
14:43 πŸ”— Fusl just by specifying a third crush rule
14:43 πŸ”— Fusl that says, use all OSDs
14:44 πŸ”— Fusl like
14:44 πŸ”— Kenshin "use whatever free space available i don't care"?
14:44 πŸ”— Fusl dont limit what OSD class to use
14:44 πŸ”— Fusl correct
14:44 πŸ”— Fusl and you can still have cache tiering on top of that
14:44 πŸ”— Fusl like, creating a 4th pool that says, use all SSDs
14:44 πŸ”— Fusl and a 5th pool that says, use all HDDs use the 4th pool as cache tier
14:45 πŸ”— Kenshin does the cache tiering really work though?
14:45 πŸ”— Fusl yes
14:45 πŸ”— Fusl because you define a pool overlay
14:45 πŸ”— Kenshin but it allocates 200% of space? 100% of each?
14:45 πŸ”— Fusl and also define how the data is overlayed between the pools
14:45 πŸ”— Fusl nope
14:45 πŸ”— Fusl you can tell it to drop cold data after a while from the 4th to the 5th pool
14:45 πŸ”— Fusl http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
14:47 πŸ”— Kenshin i assume you've tested it? what kind of use case would you recommend?
14:47 πŸ”— Fusl i have tested it and
14:47 πŸ”— Fusl corrupted my data
14:48 πŸ”— Kenshin lol
14:48 πŸ”— Kenshin ok, don't use.
14:48 πŸ”— Fusl i am still not sure if i was just wayy too tired
14:48 πŸ”— Kenshin i can't really think of a use case based on VMs
14:48 πŸ”— Fusl /topic Today's agenda: Ceph crush-course Β―\_(ツ)_/Β―
14:48 πŸ”— Kenshin it just seems like stuff will get shuffled between HDD-SSD way too much
14:49 πŸ”— Kenshin and it's not like it's file based, it's block based
14:49 πŸ”— Kenshin which may generally make no sense at all
14:49 πŸ”— Fusl things that are either read or write heavy or both would benefit from it
14:49 πŸ”— Fusl but onyl if they keep hitting the same objects
14:50 πŸ”— Fusl like, databases
14:50 πŸ”— Fusl or websites
14:50 πŸ”— Kenshin yeah the thing with database is that, mysql would write to the same file but when converted to block by filesystem it may not be the same exact block
14:50 πŸ”— Kenshin that's my thinking where things would likely become really messy
14:51 πŸ”— Fusl yep
14:51 πŸ”— Kenshin if it's file storage, maybe S3 -> CEPH then it would make a lot of sense
14:52 πŸ”— Fusl rados/S3 or cephfs would be another candidate, yeah
14:52 πŸ”— Kenshin did you have the chance to test PCIE based SSDs with this?
14:52 πŸ”— Fusl for caching?
14:53 πŸ”— Kenshin cause my nodes are all 3.5" hdd slots, putting 2.5" ssds seems like a complete waste
14:53 πŸ”— Kenshin for everything
14:53 πŸ”— Kenshin i need to buy new SSD/HDD for this project anyway
14:53 πŸ”— Fusl i dont have any pcie-based ssds but from what i heard, the performance is pretty good
14:53 πŸ”— Kenshin the stuff i have in stock are all 256GB SSD or 1/2TB HDD due to dedi servers
14:53 πŸ”— Kenshin instead of wasting 4 slots for SSD, might as well slap in a nice big PCIE SSD
14:55 πŸ”— Fusl bcache is also another thing
14:55 πŸ”— Fusl like, have bcache use nvme for caching and hdds for the cold storage devices
14:56 πŸ”— Fusl and then point ceph to use the virtual bcache devices
14:57 πŸ”— Kenshin hmm, that sounds like an idea
14:57 πŸ”— Kenshin but that means i gotta partition the nvme (assuming qty < number of hdds) then attach them?
14:58 πŸ”— Fusl yep
14:58 πŸ”— Kenshin there's journals as well right
14:59 πŸ”— Fusl oh yes
14:59 πŸ”— Fusl you'll need to store them on nvme as well and then be careful about how you do that
15:00 πŸ”— Fusl but bluestore journals on hard drives is essentially really good by now
15:00 πŸ”— Kenshin so don't really need to do it on SSD
15:00 πŸ”— Fusl so you can just put the journal onto the HDDs
15:00 πŸ”— Kenshin less surprises
15:01 πŸ”— Fusl unless you want to run filestore for which you have absolutely no good reason to
15:01 πŸ”— jspiros is now known as jspiros_
15:15 πŸ”— Kenshin Fusl: if i use a PCI-E based 2TB SSD, would it be sufficient?
15:15 πŸ”— Kenshin or the better question, how many partitions do i need?
15:15 πŸ”— Fusl for journal or caching?
15:15 πŸ”— Kenshin data
15:15 πŸ”— Kenshin customer data
15:15 πŸ”— Kenshin journal like you said, use the OSD itself
15:15 πŸ”— Fusl yep
15:15 πŸ”— Kenshin caching is risky
15:16 πŸ”— Fusl 2TB sounds fine
15:16 πŸ”— Kenshin so just pure customer data
15:16 πŸ”— Kenshin intel P4600 is either 2TB or 4TB
15:16 πŸ”— Kenshin max speed 3200MB/sec read, 1575MB/sec write
15:16 πŸ”— Fusl if you wanna go with the larger one i dont see a reason why you shouldn't
15:16 πŸ”— Kenshin $$$
15:16 πŸ”— Kenshin need to buy 3 remember
15:16 πŸ”— Kenshin lol
15:16 πŸ”— Kenshin gets expensive
15:16 πŸ”— Fusl yeah but
15:16 πŸ”— Fusl more storage
15:16 πŸ”— Fusl :P
15:17 πŸ”— Kenshin question is whether i should bother with 2x2TB
15:19 πŸ”— Fusl i wouldnt
15:19 πŸ”— Fusl size/replication 2 is not recommended
15:19 πŸ”— Fusl neither is 1 obviously
15:21 πŸ”— Fusl do you mean 2x2TB per host?
15:21 πŸ”— Fusl so 3x2x2TB?
15:23 πŸ”— Kenshin 2 PCIE cards per host, each 2TB
15:23 πŸ”— Kenshin vs 1x 4TB
15:24 πŸ”— Kenshin i get more bandwidth definitely, but bandwidth isn't an issue
15:24 πŸ”— Kenshin since i'm stuck with 2x10G network
16:19 πŸ”— Zerote_ has joined #archiveteam-ot
16:25 πŸ”— chferfa has joined #archiveteam-ot
16:57 πŸ”— Zerote_ has quit IRC (Ping timeout: 263 seconds)
18:11 πŸ”— Zerote has joined #archiveteam-ot
18:52 πŸ”— jspiros__ has quit IRC ()
19:01 πŸ”— Stiletto has quit IRC (Ping timeout: 252 seconds)
19:04 πŸ”— Stiletto has joined #archiveteam-ot
19:08 πŸ”— Stiletto has quit IRC (Ping timeout: 246 seconds)
19:09 πŸ”— jspiros has joined #archiveteam-ot
19:12 πŸ”— Stiletto has joined #archiveteam-ot
19:17 πŸ”— Stiletto has quit IRC (Read error: Operation timed out)
20:29 πŸ”— t2t2 has quit IRC (Read error: Operation timed out)
20:29 πŸ”— t2t2 has joined #archiveteam-ot
21:51 πŸ”— ivan has quit IRC (Leaving)
21:52 πŸ”— ivan has joined #archiveteam-ot
21:55 πŸ”— ivan has quit IRC (Client Quit)
21:57 πŸ”— ivan has joined #archiveteam-ot
22:18 πŸ”— killsushi has joined #archiveteam-ot
22:31 πŸ”— BlueMax has joined #archiveteam-ot
23:30 πŸ”— m007a83_ has joined #archiveteam-ot
23:32 πŸ”— m007a83 has quit IRC (Ping timeout: 252 seconds)
23:53 πŸ”— Flashfire 10 rotations IPs at his caravan park

irclogger-viewer