Deploying Ceph with storage tiering

You have several options to deploy storage tiering within Ceph. In this post I will show you a simple yet powerful approach to automatically update the CRUSHmap and create storage policies.

Some basics

Storage tiering means having several tiers available. The classic 3 tiered approach is:

fast: all flash
medium: disks accelerated by some flash journals
slow: archive disks with collocated journals

Tiered CRUSHmap

First we will configure the crush location hook. It is a script invoked on OSD start to determine the OSD’s location in the CRUSHmap. To make things simple I use the size of a disk to find out which tier it should belong to:

Bigger than 6 TB → Archive drive
Between 1.6TB and 6TB → Disk with flash journal
Smaller than 1.6TB → SSD assumed

This works in most environments but you might want to adjust the script to fit your environment.

Append the following code to a copy of /usr/bin/ceph-crush-location, then specify its path with osd crush location hook in ceph.conf.

# more than 6TB for slow
size_limit_slow=6000
# more than 1.6TB for medium
size_limit_medium=1600

size=$(df /var/lib/ceph/osd/ceph-$id | awk '{if(NR > 1){printf "%d", $2/1024/1024}}')
if [ "$size" -gt "$size_limit_slow" ]
then
  tier="slow"
if [ "$size" -gt "$size_limit_medium" ]
then
  tier="medium"
else
  tier="fast"
fi
echo "host=$(hostname -s)-$tier root=$tier"

After a restart your OSDs will show up in a tier specific root, the OSD tree should look like that:

root fast
- host ceph-1-fast
- host ceph-2-fast
- host ceph-3-fast
root medium
- host ceph-1-medium
- host ceph-2-medium
- host ceph-3-medium
root slow
- host ceph-1-slow
- host ceph-2-slow
- host ceph-3-slow

Creating rulesets

Rulesets allow you to describe your storage policies. We will use rulesets to restrict storage pools to each tiers. You can easily do this by editing the CRUSHmap. Below is an example of rulesets for replicated pools with copies stored on different hosts.

rule fast {
  ruleset 1
  type replicated
  min_size 1
  max_size 10
  step take fast
  step chooseleaf firstn 0 type host
  step emit
}
rule medium {
  ruleset 2
  type replicated
  min_size 1
  max_size 10
  step take medium
  step chooseleaf firstn 0 type host
  step emit
}
rule slow {
  ruleset 3
  type replicated
  min_size 1
  max_size 10
  step take slow
  step chooseleaf firstn 0 type host
  step emit
}

Bring it all together

To finish you simply set the appropriate ruleset to each storage pool as shown below and you are ready to go.

# Set fast tier for the rbd-fast
ceph osd pool set rbd-fast crush_ruleset 1
# Set medium tier for the rbd pool
ceph osd pool set rbd crush_ruleset 2
# Set slow tier for the "archives" pool
ceph osd pool set archives crush_ruleset 3

Monitoring

If you are doing Ceph tiering in production, you quickly realize that the output of ceph status shows the combined available and used space of all tiers.

To find the used space of each storage tier use ceph osd df tree. You can reflect that in your monitoring system with the following command:

# Show percentage space used, space used and
ceph@ceph-1:~# sudo ceph osd df tree | grep 'root ' | awk '{print $10 ":", $7 "%" " " $5 "/" $4}'
fast: 50.11% 1169G/2332G
medium: 23.28% 3059G/13142G
slow: 10.19% 6153G/60383G

Edit

Since Ceph Luminious there is native support for storage tiering under the name device classes.