meta data for this page
  •  

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
vm:proxmox:ceph:performance [2021/05/06 13:47]
niziak
vm:proxmox:ceph:performance [2024/05/19 10:39] (current)
niziak
Line 4: Line 4:
   * [[https://​yourcmc.ru/​wiki/​Ceph_performance]]   * [[https://​yourcmc.ru/​wiki/​Ceph_performance]]
   * [[https://​accelazh.github.io/​ceph/​Ceph-Performance-Tuning-Checklist|Ceph Performance Tuning Checklist]]   * [[https://​accelazh.github.io/​ceph/​Ceph-Performance-Tuning-Checklist|Ceph Performance Tuning Checklist]]
 +  * [[https://​www.reddit.com/​r/​ceph/​comments/​zpk0wo/​new_to_ceph_hdd_pool_is_extremely_slow/​|New to Ceph, HDD pool is extremely slow]]
 +  * [[https://​forum.proxmox.com/​threads/​ceph-storage-performance.129408/#​post-566971|Ceph Storage Performance]]
  
-==== block.db and block.wal ​====+===== Performance tips =====
  
-  The DB stores BlueStore’s internal metadata ​and the WAL is BlueStore’s internal journal or write-ahead logIt is recommended to use a fast SSD or NVRAM for better performance.+Ceph is build for scale and works great in large clustersIn small cluster every node will be heavily loaded.
  
-  ​Important +  ​* adapt PG to number of OSDs to spread traffic evenly 
-  Since Ceph has to write all data to the journal ​(or WAL+DBbefore it can ACK writes, having this metadata and OSD performance in balance is really important! ​+  * use ''​krbd''​ 
 +  * enable ''​writeback''​ on VMs (possible data loss on consumer SSDs)
  
-For hosts with multiple HDDs (multiple OSDs), it is possible to use one SSD for all OSDS DB/WAL (one partition per OSD).+==== performance on small cluster ====
  
-NOTEThe recommended scenario ​for mixed setup for one host is to use +  * [[https://​www.youtube.com/​watch?​v=LlLLJxNcVOY|Configuring Small Ceph Clusters ​for Optimal Performance - Josh Salomon, Red Hat]] 
-  * multiple HDDS (one OSD per HDD) +  * number of PG should be power of 2 (or middle between powers of 2) 
-  * one fast SSD/NVMe drive for DB/​WAL ​(i.e. 30GB per 2TB HDD only needed).+  * same utilization (% full) per device 
 +  * same number of PG per OSD := same number of request ​per device 
 +  * same number of primary PG per OSD = read operations spread evenly 
 +    * primary PG - original/first PG - others are replicas. Primary PG is used for read. 
 +  * use relatively more PG than for big cluster - better balance, but handling PGs consumes resources ​(RAM) 
 +    * i.e. for 7 OSD x 2TB PG autoscaler recommends 256 PG. After changing to 384 IOops drastivally increases and latency drops. 
 +      Setting to 512 PG wasn't possible because limit of 250PG/OSD.
  
-Proxmox UI and CLI expects only whole device as DB device, not partition!. It will not destroy existing drive. It expect LVM volume with free space and it will create new LVM volume for DB/WAK.+=== balancer ===
  
-Ceph native CLI can work with partition specified as DB (it also works with whole drive or LVM). 
- 
-**MORE INFO:** 
-  * https://​docs.ceph.com/​en/​latest/​rados/​configuration/​bluestore-config-ref/#​block-and-block-db 
-  * https://​docs.ceph.com/​en/​latest/​rados/​operations/​add-or-rm-osds/#​adding-removing-osds 
-  * https://​www.reddit.com/​r/​ceph/​comments/​jnyxgm/​how_do_you_create_multiple_osds_per_disk_with/​ 
- 
- 
-==== DB/WAL sizes ==== 
-  * If there is <1GB of fast storage, the best is to use it as WAL only (without DB). 
-  * if a DB device is specified but an explicit WAL device is not, the WAL will be implicitly colocated with the DB on the faster device. 
- 
-DB size: 
-  * (still true for Octopus 15.2.6 ) DB should be 30GB. And this doesn'​t depend on the size of the data partition. ​ 
-     * all block.db sizes except **4, 30, 286 GB** are pointless, ​ 
-       * see: [[https://​yourcmc.ru/​wiki/​Ceph_performance#​About_block.db_sizing|About block.db sizing]] 
-       * [[https://​github.com/​facebook/​rocksdb/​wiki/​Leveled-Compaction|Leveled Compaction]] 
-  * should have as large as possible logical volumes 
-  * for RGW (Rados Gateway) workloads: min 4% of block device size 
-  * for RBD (Rados Block Device) workloads: 1-2% is enough (2% from 2TB is 40GB) 
-  * according to ''​ceph daemon osd.0 perf dump | jq .bluefs''​ 80GB was reserved on HDD for DB, where 1.6-2.4GB is used. 
- 
- 
-===== Adding journal DB/WAL partition ===== 
- 
-If an OSD needs to be shutdown for maintenance (i.e. adding new disc), please set ''​ceph osd set noout''​ to prevent unnecessary data balance. 
- 
-==== Create parition on NVM drive ==== 
- 
-Reorganize existing NVM/SSD disc to make some free space. Create empty partition on free space. 
 <code bash> <code bash>
-# remove cache partition from zpool +ceph mgr module enable balancer 
-zpool list -v +ceph balancer on 
-zpool remove rpool /​dev/​nvme0n1p3 +ceph balancer mode upmap
-... +
-reorganize partition +
-... +
-blkid +
-zpool add rpool cache 277455ae-1bfa-41f6-8b89-fd362d35515e+
 </​code>​ </​code>​
  
-==== Replace OSD ====+=== CRUSH reweight ​===
  
-<code bash> +If possible use ''​balancer''​
-ceph osd tree+
  
-ceph device ls-by-host pve5 +Override default CRUSH assignment.
-DEVICE ​                    ​DEV ​ DAEMONS ​ EXPECTED FAILURE +
-TOSHIBA_HDWD120_30HN40HAS ​ sdc  osd.2  ​+
  
-### Switch OFF OSD. Ceph should rebalance data from replicas when OSD is switched off directly 
-ceph osd out X 
-## or better use lines below: 
-# this is optional for safety for small clusters instead of using ceph out osd.2 
-ceph osd reweight osd.X 0 
-# wait for data migration away from osd.X 
-watch 'ceph -s; ceph osd df tree' 
-# Remove OSD 
-ceph osd out X 
  
-ceph osd safe-to-destroy osd.X +=== PG autoscaler ===
-ceph osd down X +
-systemctl stop ceph-osd@X.service +
-ceph osd destroy X +
- +
-#pveceph osd destroy X +
- +
- +
-# to remove partition table, boot sector and any OSD leftover: +
-ceph-volume lvm zap /dev/sdX --destroy +
- +
-## it is not possible to specify DB partition with pveceph command (read begining of page): +
-# pveceph osd create /dev/sdc --db_dev /​dev/​nvme0n1p3 +
-## it requires whole device as db dev with LVM and will create new LVM on free space, i.e. +
-# pveceph osd create /dev/sdc --db_dev /​dev/​nvme0n1 --db_size 32G +
-## so direct ceph command will be used: +
- +
-# Prevent backfilling when new osd will be added +
-ceph osd set nobackfill +
- +
-### Create OSD: +
-ceph-volume lvm create --osd-id X --bluestore --data /dev/sdc --block.db /​dev/​nvme0n1p3 +
-# or split above into two step: +
-ceph-volume lvm prepare --bluestore --data /dev/sdX --block.db /​dev/​nvme0n1pX +
-ceph-volume lvm activate --bluestore X e56ecc53-826d-40b0-a647-xxxxxxxxxxxx +
-# also possible: ceph-volume lvm activate --all +
- +
-## DRAFTS: +
-#​ceph-volume lvm create --cluster-fsid 321bdc94-39a5-460a-834f-6e617fdd6c66 --data /dev/sdc --block.db /​dev/​nvme0n1p3 +
-#​ceph-volume lvm activate --bluestore <osd id> <osd fsid> +
-</​code>​ +
- +
-Verify:+
  
 +Better to use in warn mode, to do not put unexpected load when PG number will change.
 <code bash> <code bash>
-ls -l /var/lib/ceph/​osd/​ceph-X/​ +ceph mgr module enable pg_autoscaler 
-lrwxrwxrwx 1 ceph ceph  93 sty 28 17:59 block -> /​dev/​ceph-16a69325-6fb3-4d09-84ee-c053c01f410f/​osd-block-e56ecc53-826d-40b0-a647-5f1a1fc8800e +#ceph osd pool set <poolpg_autoscale_mode ​<mode
-lrwxrwxrwx 1 ceph ceph  14 sty 28 17:59 block.db -> /​dev/​nvme0n1p3 +ceph osd pool set rbd pg_autoscale_mode warn
- +
-ceph daemon osd.X perf dump | jq '​.bluefs'​ +
-+
-  "​gift_bytes":​ 0, +
-  "​reclaim_bytes":​ 0, +
-  "​db_total_bytes":​ 42949664768, ​   --> 39,​99GB ​ (40GB partition created) +
-  "​db_used_bytes":​ 1452269568, ​     --> ​ 1,35GB +
-  "​wal_total_bytes":​ 0, +
-  "​wal_used_bytes":​ 0, +
-... +
-OR +
-  "​db_total_bytes":​ 4294959104, ​    ​-->​ 3,9GB (4GB partition) +
-  "​db_used_bytes":​ 66052096, ​       --> +
- +
-ceph device ls +
- +
-</​code>​ +
- +
-And restore backfilling:​ +
- +
-<code bash> +
-ceph osd unset nobackfill +
-</code> +
- +
-Check benefits: +
-  * Observe better latency on OSD with NVM/​SSD: ​<code bash>​watch ceph osd perf</​code+
-  * check ''​iotop''​ output. Now ''​[bstore_kv_sync]''​ should take less time. +
- +
- +
- +
- +
-==== Issues ==== +
- +
-=== auth: unable to find a keyring === +
- +
-It is not possible to create ​ceph OSD neither from WebUI nor cmdline: <code bash>​pveceph ​osd create /​dev/​sdc</​code>​ +
- +
-<​code>​ +
-Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /​var/​lib/​ceph/​bootstrap-osd/​ceph.keyring mon getmap -o /​var/​lib/​ceph/​osd/​ceph-2/​activate.monmap +
- ​stderr:​ 2021-01-28T10:​21:​24.996+0100 7fd1a848f700 -1 auth: unable to find a keyring on /​etc/​pve/​priv/​ceph.client.bootstrap-osd.keyring:​ (2) No such file or directory +
-2021-01-28T10:​21:​24.996+0100 7fd1a848f700 -1 AuthRegistry(0x7fd1a0059030) no keyring found at /​etc/​pve/​priv/​ceph.client.bootstrap-osd.keyring,​ disabling cephx +
-</​code>​ +
- +
-<file init /​etc/​pve/​ceph.conf>​ +
-[client] +
-         ​keyring = /​etc/​pve/​priv/​$cluster.$name.keyring +
- +
-[mds] +
-         ​keyring = /​var/​lib/​ceph/​mds/​ceph-$id/​keyring +
-</​file>​ +
- +
-ceph.conf Variables +
-  * **$cluster** - cluster name. For proxmox it is ''​ceph''​ +
-  * **$type** - daemon process ''​mds''​ ''​osd''​ ''​mon''​ +
-  * **$id** - daemon or client indentifier. For ''​osd.0''​ it is ''​0''​ +
-  * **$host** - hostname where the process is running +
-  * **$name** - Expands to $type.$id. I.e: ''​osd.2''​ or ''​client.bootstrap''​ +
-  * **$pid** - Expands to daemon pid +
- +
-**SOLUTION:​** +
-<code bash>cp /​var/​lib/​ceph/​bootstrap-osd/​ceph.keyring /​etc/​pve/​priv/​ceph.client.bootstrap-osd.keyring</​code>​ +
-alternative to try: change ceph.conf +
- +
-=== Unit -.mount is masked. === +
- +
-<​code>​ +
-Running command: /​usr/​bin/​systemctl start ceph-osd@2 +
- ​stderr:​ Failed to start ceph-osd@2.service:​ Unit -.mount is masked. +
---> ​ RuntimeError:​ command returned non-zero exit status: 1+
 </​code>​ </​code>​
  
-It was caused by ''​gparted''​ which wasn't correctly shutdown. +It is possible to set desired/target size of poolThis prevents autoscaler to move data every time new data are stored.
-  * [[https://​askubuntu.com/​questions/​1191596/​unit-mount-is-masked|Unit -.mount is masked]] +
-  * [[https://bugs.debian.org/​cgi-bin/​bugreport.cgi?​bug=948739|gparted should not mask .mount units]] +
-  * [[https://​unix.stackexchange.com/​questions/​533933/​systemd-cant-unmask-root-mount-mount/​548996]]+
  
-**Solution:​**+==== check cluster balance ====
  
-<code bash>​systemctl ​--runtime unmask -- -.mount</​code>​+ceph -
 +ceph osd df shows standard deviation
  
-To list runtime masked units: +no tools to show primary PG balancing. Tool on https://github.com/JoshSalomon/Cephalocon-2019/blob/master/pool_pgs_osd.sh
-<code bash>ls -l /var/run/systemd/system | grep mount | grep '/dev/null' | cut -d ' ' -f 11</code>+
  
-To unescape systemd unit names: ​ 
-<code bash> systemd-escape -u '​rpool-data-basevol\x2d800\x2ddisk\x2d0.mount'</​code>​ 
  
 +==== performance on slow HDDs ====