Ceph osd.x spilled over

Last updated on Sep 17, 2023

Recently I encountered my first spillovers after upgrading to Ceph 18.2.

In other places, like Ceph health detail, such situation manifests as:

BLUEFS_SPILLOVER: 6 OSD(s) experiencing BlueFS spillover
osd.1 spilled over 845 MiB metadata from 'db' device (25 GiB used of 50 GiB) to slow device
osd.2 spilled over 64 KiB metadata from 'db' device (21 GiB used of 50 GiB) to slow device
osd.5 spilled over 2.8 GiB metadata from 'db' device (22 GiB used of 50 GiB) to slow device
osd.6 spilled over 1.4 GiB metadata from 'db' device (23 GiB used of 50 GiB) to slow device
osd.7 spilled over 634 MiB metadata from 'db' device (24 GiB used of 50 GiB) to slow device
osd.8 spilled over 12 GiB metadata from 'db' device (15 GiB used of 50 GiB) to slow device

This did not appear to be a trivial problem at first. But after some googling, I found https://docs.ceph.com/en/quincy/rados/operations/health-checks/ which outlines some information about expanding the db partition.

Recovery procedure

Ensure OSDs are not dropped from the cluster when they go down:

ceph osd set noout

Stop the first OSD that has spilled over:

systemctl stop ceph-xxxx@osd.1.service

Expand it's DB volume:

lvresize -L+20G /dev/osd/1_db

Expand it's bdev:

cephadm shell --name osd.1 ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-1

Wait for a while; for me, this command took anywhere from 10-20 seconds to 10 minutes at most.
The output looked similar to:

Inferring fsid d37ba800-0e8d-11ed-868b-90b11c0a98d9
Using recent ceph image quay.io/ceph/ceph@sha256:bffa28055a8df508962148236bcc391ff3bbf271312b2e383c6aa086c086c82c
inferring bluefs devices from bluestore path
0 : device size 0x13ffff000 : using 0x4f200000(1.2 GiB)
1 : device size 0x117fffe000 : using 0x63b600000(25 GiB)
2 : device size 0x105effc00000 : using 0xacfc8fa0000(11 TiB)
Expanding DB/WAL...
0 : expanding  to 0x5368709120
0 : size label updated to 5368709120
1 : expanding  to 0x75161927680
1 : size label updated to 75161927680

Start the OSD and wait for it to re-join the cluster:

systemctl start ceph-xxxx@osd.1.service

Perform manual compaction:

ceph tell osd.1 compact

{
    "elapsed_time": 167.97297091600001
}

Notes

This problem might be entirely due to my setup, I have separate WAL and DB block devices for each of my OSDs; I'm not sure if this problem even applies to all-in-one OSDs (block+wal+db running on one disk). I have created my OSDs like this:

ceph orch daemon add osd zebox:data_devices=/dev/sdh,db_devices=/dev/osd/0_db,wal_devices=/dev/osd/0_wal