Book Creator
Add this page to your book

Book Creator
Remove this page from your book

This is an old revision of the document!

CEPH performance

block.db and block.wal

The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s internal journal or write-ahead log. It is recommended to use a fast SSD or NVRAM for better performance.

Important
Since Ceph has to write all data to the journal (or WAL+DB) before it can ACK writes, having this metadata and OSD performance in balance is really important!

For hosts with multiple HDDs (multiple OSDs), it is possible to use one SSD for all OSDS DB/WAL (one partition per OSD).

NOTE: The recommended scenario for mixed setup for one host is to use

multiple HDDS (one OSD per HDD)
one fast SSD/NVMe drive for DB/WAL (i.e. 30GB per 2TB HDD only needed).

Proxmox UI and CLI expects only whole device as DB device, not partition!. It will not destroy existing drive. It expect LVM volume with free space and it will create new LVM volume for DB/WAK.

Ceph native CLI can work with partition specified as DB (it also works with whole drive or LVM).

MORE INFO:

DB/WAL sizes

If there is <1GB of fast storage, the best is to use it as WAL only (without DB).
if a DB device is specified but an explicit WAL device is not, the WAL will be implicitly colocated with the DB on the faster device.

DB size:

(still true for Octopus 15.2.6 ) DB should be 30GB. And this doesn't depend on the size of the data partition.
- all block.db sizes except 4, 30, 286 GB are pointless,
  - see: About block.db sizing
  - Leveled Compaction
should have as large as possible logical volumes
for RGW (Rados Gateway) workloads: min 4% of block device size
for RBD (Rados Block Device) workloads: 1-2% is enough (2% from 2TB is 40GB)
according to ceph daemon osd.0 perf dump 80GB was reserved on HDD for DB, where 1.6-2.4GB is used.

Adding journal DB/WAL partition

Reorganize existing NVM/SSD disc to make some free space. Create empty partition on free space.

# remove cache partition from zpool
zpool list -v
zpool remove rpool /dev/nvme0n1p3
...
reorganize partition
...
blkid
zpool add rpool cache 277455ae-1bfa-41f6-8b89-fd362d35515e

If an OSD needs to be shutdown for maintenance (i.e. adding new disc), please set ceph osd set noout to prevent unnecessary data balance.

Remove an osd from cluster

ceph osd tree
 
ceph device ls-by-host pve5
DEVICE                     DEV  DAEMONS  EXPECTED FAILURE
TOSHIBA_HDWD120_30HN40HAS  sdc  osd.2  
 
### Switch OFF OSD. Ceph should rebalance data from replicas when OSD is switched off directly
ceph out osd.2
## or better use lines below:
# this is optional for safety for small clusters instead of using ceph out osd.2
ceph osd reweight osd.2 0
# wait for data migration away from osd.2
watch 'ceph -s; ceph osd df tree'
 
# Remove OSD
ceph out osd.2
ceph osd safe-to-destroy osd.2
systemctl stop ceph-osd@2.service
pveceph osd destroy 2
 
 
# to remove partition table, boot sector and any OSD leftover:
ceph-volume lvm zap /dev/sdc --destroy
 
## it is not possible to specify DB partition with pveceph command (read begining of page):
# pveceph osd create /dev/sdc --db_dev /dev/nvme0n1p3
## it requires whole device as db dev with LVM and will create new LVM on free space, i.e.
# pveceph osd create /dev/sdc --db_dev /dev/nvme0n1 --db_size 32G
## so direct ceph command will be used:
 
#??? verify this step #ceph osd create 2
 
# Prevent backfilling when new osd will be added
ceph osd set nobackfill
 
### Create OSD:
ceph-volume lvm create --bluestore --data /dev/sdc --block.db /dev/nvme0n1p3
# or split above into two step:
ceph-volume lvm prepare --bluestore --data /dev/sdc --block.db /dev/nvme0n1p3
ceph-volume lvm activate --bluestore 2 e56ecc53-826d-40b0-a647-5f1a1fc8800e
# also possible: ceph-volume lvm activate --all
 
## DRAFTS:
#ceph-volume lvm create --bluestore --osd-id 2 --data /dev/sdc --block.db /dev/nvme0n1p3
#ceph-volume lvm create --cluster-fsid 321bdc94-39a5-460a-834f-6e617fdd6c66 --data /dev/sdc --block.db /dev/nvme0n1p3
#ceph-volume lvm activate --bluestore <osd id> <osd fsid>

Verify:

ls -l /var/lib/ceph/osd/ceph-2/
lrwxrwxrwx 1 ceph ceph  93 sty 28 17:59 block -> /dev/ceph-16a69325-6fb3-4d09-84ee-c053c01f410f/osd-block-e56ecc53-826d-40b0-a647-5f1a1fc8800e
lrwxrwxrwx 1 ceph ceph  14 sty 28 17:59 block.db -> /dev/nvme0n1p3
 
ceph daemon osd.2 perf dump | jq '.bluefs'
{
  "gift_bytes": 0,
  "reclaim_bytes": 0,
  "db_total_bytes": 42949664768,
  "db_used_bytes": 942661632,
  "wal_total_bytes": 0,
  "wal_used_bytes": 0,
...
 
ceph device ls

And restore backfilling:

ceph osd unset nobackfill

Check benefits:

Observe better latency on OSD with NVM/SSD:
```
watch ceph osd perf
```
check iotop output. Now [bstore_kv_sync] should take less time.

Issues

auth: unable to find a keyring

It is not possible to create ceph OSD neither from WebUI nor cmdline:

pveceph osd create /dev/sdc

Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-2/activate.monmap
 stderr: 2021-01-28T10:21:24.996+0100 7fd1a848f700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
2021-01-28T10:21:24.996+0100 7fd1a848f700 -1 AuthRegistry(0x7fd1a0059030) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx

/etc/pve/ceph.conf

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring
 
[mds]
         keyring = /var/lib/ceph/mds/ceph-$id/keyring

ceph.conf Variables

$cluster - cluster name. For proxmox it is ceph
$type - daemon process mds osd mon
$id - daemon or client indentifier. For osd.0 it is 0
$host - hostname where the process is running
$name - Expands to $type.$id. I.e: osd.2 or client.bootstrap
$pid - Expands to daemon pid

SOLUTION:

cp /var/lib/ceph/bootstrap-osd/ceph.keyring /etc/pve/priv/ceph.client.bootstrap-osd.keyring

alternative to try: change ceph.conf

Unit -.mount is masked.

Running command: /usr/bin/systemctl start ceph-osd@2
 stderr: Failed to start ceph-osd@2.service: Unit -.mount is masked.
-->  RuntimeError: command returned non-zero exit status: 1

It was caused by gparted which wasn't correctly shutdown.

Solution:

systemctl --runtime unmask -- -.mount

To list runtime masked units:

ls -l /var/run/systemd/system | grep mount | grep '/dev/null' | cut -d ' ' -f 11

To unescape systemd unit names:

 systemd-escape -u 'rpool-data-basevol\x2d800\x2ddisk\x2d0.mount'

Tools

menus and quick search

quick search

site status

Page Tools

meta data for this page