====== DB adding ======
NOTE: if partition is formatted as LVM PV, it is possible to choose this partition for DB from Proxmox UI when creating OSD:
pvcreate /dev/sda5
vgcreate ceph-db-40gb /dev/sda5
{{:vm:proxmox:ceph:db:pasted:20240214-080040.png}}
===== Adding journal DB/WAL partition =====
If an OSD needs to be shutdown for maintenance (i.e. adding new disc), please set ''ceph osd set noout'' to prevent unnecessary data balance.
==== Create parition on NVM drive ====
Reorganize existing NVM/SSD disc to make some free space. Create empty partition on free space.
=== Cut some space from zpool cache NVM partition ===
# remove cache partition from zpool
zpool list -v
zpool remove rpool /dev/nvme0n1p3
...
reorganize partition
...
blkid
zpool add rpool cache 277455ae-1bfa-41f6-8b89-fd362d35515e
=== Cut some space from zpool ===
Example how to cut some space from ''nvmpool'' zpool with spare temporary drive:
* we have 1 spare HDD which will be new Ceph OSD in future
* zpool doesn't support online shrinking.
* move ''nvmpool'' to spare HDD, to release NVM ''nvmpool'' partition.
zpool replace nvmpool nvme0n1p4 sdc
# zpool status nvmpool
pool: nvmpool
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Thu May 6 14:13:32 2021
70.2G scanned at 249M/s, 21.0G issued at 74.4M/s, 70.2G total
21.1G resilvered, 29.91% done, 00:11:17 to go
config:
NAME STATE READ WRITE CKSUM
nvmpool ONLINE 0 0 0
replacing-0 ONLINE 0 0 0
nvme0n1p4 ONLINE 0 0 0
sdc ONLINE 0 0 0 (resilvering)
* wait for resilver
* reorganize partitions
* replace disks again
zpool replace nvmpool sdc nvme0n1p4
==== Replace OSD ====
ceph osd tree
ceph device ls-by-host pve5
DEVICE DEV DAEMONS EXPECTED FAILURE
TOSHIBA_HDWD120_30HN40HAS sdc osd.2
### Switch OFF OSD. Ceph should rebalance data from replicas when OSD is switched off directly
ceph osd out X
## or better use lines below:
# this is optional for safety for small clusters instead of using ceph out osd.2
ceph osd reweight osd.X 0
# wait for data migration away from osd.X
watch 'ceph -s; ceph osd df tree'
# Remove OSD
ceph osd out X
ceph osd safe-to-destroy osd.X
ceph osd down X
systemctl stop ceph-osd@X.service
ceph osd destroy X
#pveceph osd destroy X
# to remove partition table, boot sector and any OSD leftover:
ceph-volume lvm zap /dev/sdX --destroy
## it is not possible to specify DB partition with pveceph command (read begining of page):
# pveceph osd create /dev/sdc --db_dev /dev/nvme0n1p3
## it requires whole device as db dev with LVM and will create new LVM on free space, i.e.
# pveceph osd create /dev/sdc --db_dev /dev/nvme0n1 --db_size 32G
## so direct ceph command will be used:
# Prevent backfilling when new osd will be added
ceph osd set nobackfill
### Create OSD:
ceph-volume lvm create --osd-id X --bluestore --data /dev/sdc --block.db /dev/nvme0n1p3
# or split above into two step:
ceph-volume lvm prepare --bluestore --data /dev/sdX --block.db /dev/nvme0n1pX
ceph-volume lvm activate --bluestore X e56ecc53-826d-40b0-a647-xxxxxxxxxxxx
# also possible: ceph-volume lvm activate --all
## DRAFTS:
#ceph-volume lvm create --cluster-fsid 321bdc94-39a5-460a-834f-6e617fdd6c66 --data /dev/sdc --block.db /dev/nvme0n1p3
#ceph-volume lvm activate --bluestore
Verify:
ls -l /var/lib/ceph/osd/ceph-X/
lrwxrwxrwx 1 ceph ceph 93 sty 28 17:59 block -> /dev/ceph-16a69325-6fb3-4d09-84ee-c053c01f410f/osd-block-e56ecc53-826d-40b0-a647-5f1a1fc8800e
lrwxrwxrwx 1 ceph ceph 14 sty 28 17:59 block.db -> /dev/nvme0n1p3
ceph daemon osd.X perf dump | jq '.bluefs'
{
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 42949664768, --> 39,99GB (40GB partition created)
"db_used_bytes": 1452269568, --> 1,35GB
"wal_total_bytes": 0,
"wal_used_bytes": 0,
...
# OR
"db_total_bytes": 4294959104, --> 3,9GB (4GB partition)
"db_used_bytes": 66052096, -->
ceph device ls
And restore backfilling:
ceph osd unset nobackfill
Check benefits:
* Observe better latency on OSD with NVM/SSD: watch ceph osd perf
* check ''iotop'' output. Now ''[bstore_kv_sync]'' should take less time.