Data capacity is the first thing that comes to mind while talking about large ceph clusters, or data storage systems in general. The number of drives is another measure to think about. And sometimes maximum iops is something to look out for, especially while considering a full-flash / nvme cluster. But heaviness? What does that even mean?

First interpretation maybe is the total weight of any server hardware, - and having seen some wheels of some racks bent under the total pressure, that’s definitely something to account for, but not for this post.

## Evenly filling different size boxes

Ceph uses an algorithm called
CRUSH to place
the data all over the cluster. And to be able to distribute data evenly, this
algorithm assigns a weight to each of the drives. This is a unitless quantity
meaning it’s just there comparatively. So a drive with weight `1`

will have `x`

amount of data, while another drive in the same cluster with weight `2`

will
have `2x`

data. Simple enough.

Since we almost always want to distribute data according to the capacity, this
weight is **auto-assigned** to the capacity of the underlying drives for a few
years now. So if you have `6TB`

drives, its default crush weight would be `6`

,
or if you have `10TB`

drives it would be `10`

. In practice, it would be `5.455`

and `9.095`

, because … you know, sometimes we use `1000s`

and sometimes
`1024s`

.

## Accounting for total size of boxes

Second part is that crush takes this weight concept and aggregates it up the
hardware hierarchy. So a node with ten `6TB`

drives will have a weight of `60`

,
or if you put `10TB`

drives to a similar node instead its weight would be `100`

.

And this weight can be summed up in the rack, row, pod, etc. levels if you have
them. Whether you have these levels or not, what doesn’t change is that all ceph
clusters have a total weight. That’s the **heaviness**(!) of a ceph cluster, and
by default it’s the total raw capacity of the cluster, in tebibytes.

For example when we configure a ceph cluster `10`

storage nodes, all containing
`24`

`12TB`

drives, this cluster would have `2880TB`

of raw capacity and if we
deploy it using defaults, its root weight would be `2619.36`

. For a hypothetical
large cluster with `10.000`

`10TB`

drives you’ll have a `100PB`

cluster with a
default root weight of `90950`

. Right? Unfortunately no, you can’t.

## Limits to the scale

What would happen instead is your deployment would get stuck exactly at `7205`

osds and any other osd trying to start being unable to do so with a generic
message like

```
insert_item unable to rebuild roots with classes:
(34) Numerical result out of range
```

This looks like some kind of overflow but where? There are definitely people out
there running clusters containing more than `7205`

drives, right? Probably yes,
but I guess those clusters are anything but “**by defaults**”.

The thing is, weights in the current implementation of crush have a hardcoded
limit of `65535`

. This also means the ceph cluster can weigh a total of `65535`

and beyond that is too heavy. And if using default weights, a ceph cluster can
be as large as `72PB`

and not a single `PB`

more.

## Time to get off the defaults highway

Fortunately this purely hypothetical, rather complex and probably even unheard
of heaviness limit has a simple remedy: setting the weights to something other
than the capacity in `TiBs`

, like `10TiBs`

. Or probably setting the initial
weights to something more relatable. In global section of ceph.conf:

```
[global]
osd crush initial weight = 1
```

Or if you’re hypothetical deployment got stuck at `7205th`

same sized drive, you
can reweight your entire cluster with:

```
ceph osd crush reweight-subtree default 1
```

That way the next limit to hit would be your `65535th`

drive. So keep that in mind
while trying to scale your cluster `10x`

it’s original size! (Hint: **Don’t**)