Heaviness of large ceph clusters

Data capacity is the first thing that comes to mind while talking about large ceph clusters, or data storage systems in general. The number of drives is another measure to think about. And sometimes maximum iops is something to look out for, especially while considering a full-flash / nvme cluster. But heaviness? What does that even mean?

First interpretation maybe is the total weight of any server hardware, - and having seen some wheels of some racks bent under the total pressure, that’s definitely something to account for, but not for this post.

Evenly filling different size boxes

Ceph uses an algorithm called CRUSH to place the data all over the cluster. And to be able to distribute data evenly, this algorithm assigns a weight to each of the drives. This is a unitless quantity meaning it’s just there comparatively. So a drive with weight 1 will have x amount of data, while another drive in the same cluster with weight 2 will have 2x data. Simple enough.

Since we almost always want to distribute data according to the capacity, this weight is auto-assigned to the capacity of the underlying drives for a few years now. So if you have 6TB drives, its default crush weight would be 6, or if you have 10TB drives it would be 10. In practice, it would be 5.455 and 9.095, because … you know, sometimes we use 1000s and sometimes 1024s.

Accounting for total size of boxes

Second part is that crush takes this weight concept and aggregates it up the hardware hierarchy. So a node with ten 6TB drives will have a weight of 60, or if you put 10TB drives to a similar node instead its weight would be 100.

And this weight can be summed up in the rack, row, pod, etc. levels if you have them. Whether you have these levels or not, what doesn’t change is that all ceph clusters have a total weight. That’s the heaviness(!) of a ceph cluster, and by default it’s the total raw capacity of the cluster, in tebibytes.

For example when we configure a ceph cluster 10 storage nodes, all containing 24 12TB drives, this cluster would have 2880TB of raw capacity and if we deploy it using defaults, its root weight would be 2619.36. For a hypothetical large cluster with 10.000 10TB drives you’ll have a 100PB cluster with a default root weight of 90950. Right? Unfortunately no, you can’t.

Limits to the scale

What would happen instead is your deployment would get stuck exactly at 7205 osds and any other osd trying to start being unable to do so with a generic message like

insert_item unable to rebuild roots with classes:
(34) Numerical result out of range

This looks like some kind of overflow but where? There are definitely people out there running clusters containing more than 7205 drives, right? Probably yes, but I guess those clusters are anything but “by defaults”.

The thing is, weights in the current implementation of crush have a hardcoded limit of 65535. This also means the ceph cluster can weigh a total of 65535 and beyond that is too heavy. And if using default weights, a ceph cluster can be as large as 72PB and not a single PB more.

Time to get off the defaults highway

Fortunately this purely hypothetical, rather complex and probably even unheard of heaviness limit has a simple remedy: setting the weights to something other than the capacity in TiBs, like 10TiBs. Or probably setting the initial weights to something more relatable. In global section of ceph.conf:

[global]
osd crush initial weight = 1

Or if you’re hypothetical deployment got stuck at 7205th same sized drive, you can reweight your entire cluster with:

ceph osd crush reweight-subtree default 1

That way the next limit to hit would be your 65535th drive. So keep that in mind while trying to scale your cluster 10x it’s original size! (Hint: Don’t)