ceph daemon osd.$(ls -1 /var/lib/ceph/osd/ceph-?/../ | cut -d "-" -f 2) perf dump | jq ".bluefs | {db_total_bytes, db_used_bytes}"
1st machine:
{ "db_total_bytes": 3221217280, "db_used_bytes": 1162870784 }
2nd machine:
{ "db_total_bytes": 4294959104, "db_used_bytes": 3879731200 }
It looks like DB is full on 2nd machine. Which is a bit strange because it is configured with the same configuration.
More deep look into BlueFS:
ceph daemon osd.6 bluefs stats
ceph tell osd.\* bluefs stats
1st machine:
Usage matrix: DEV/LEV WAL DB SLOW * * REAL FILES LOG 0 B 195 GiB 0 B 0 B 0 B 5.4 MiB 1 WAL 0 B 1013 MiB 0 B 0 B 0 B 1009 MiB 8 DB 0 B 731 MiB 0 B 0 B 0 B 717 MiB 24 SLOW 0 B 0 B 2.6 GiB 0 B 0 B 2.6 GiB 43 TOTALS 0 B 197 GiB 2.6 GiB 0 B 0 B 0 B 76
2nd machine:
Usage matrix: DEV/LEV WAL DB SLOW * * REAL FILES LOG 0 B 180 GiB 15 GiB 0 B 0 B 13 MiB 1 WAL 0 B 733 MiB 0 B 0 B 0 B 731 MiB 5 DB 0 B 2.9 GiB 502 MiB 0 B 0 B 3.3 GiB 72 SLOW 0 B 0 B 0 B 0 B 0 B 0 B 0 TOTALS 0 B 183 GiB 15 GiB 0 B 0 B 0 B 78
Rows is type of data to be placed, and columns show real storage of this kind of data. On 2nd machine, there is 2.9 GB placed onto DB and 502MB of DB data is placed on SLOW device !
Compacting helps a bit:
ceph tell osd.<osdid> compact # or for all osds ceph tell osd.\* compact
{ "db_total_bytes": 4294959104, "db_used_bytes": 2881486848 } Usage matrix: DEV/LEV WAL DB SLOW * * REAL FILES LOG 0 B 180 GiB 15 GiB 0 B 0 B 3.7 MiB 1 WAL 0 B 90 MiB 24 MiB 0 B 0 B 109 MiB 9 DB 0 B 2.6 GiB 319 MiB 0 B 0 B 2.9 GiB 55 SLOW 0 B 0 B 0 B 0 B 0 B 0 B 0 TOTALS 0 B 183 GiB 15 GiB 0 B 0 B 0 B 65
ceph osd set noout systemctl stop ceph.osd.target ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-6