Costs and benefits of erasure coding - In style

In the last post we left our discussion where we somehow needed to improve the recovery performance in our largish cluster but we didn’t really know what to expect or improve. So in this post we evaluate some alternatives and run some tests to see what happens when we lose a drive. But first we start by reading through the docs a bit.

We currently have 3 alternative methods for reducing the recovery load in ceph: LRC, SHEC and CLAY. We’ll start with LRC. In ceph documentation it says:

if lrc is configured with k=8, m=4 and l=4, it will create an additional parity chunk for every four OSDs

If we have additional chunks, where do we store them? Along with our data of course. So when we try to store 8 bytes, we’ll have 8 bytes in k chunks, 4 bytes in m chunks and another 3 bytes for local chunks, with a total capacity cost of 15 bytes. 15 for 8 means the space factor will be 1.875x, much higher than our target of 1.5x. Even in the more ambitious case like a k=16, m=4, l=4 the factor will still be over 1.5x. That was a short evaluation on our side because largish clusters being largish, we didn’t really have that amount of spare capacity.

Moving on to the SHEC docs, things are a bit vague at best and an unpleasant journey through the term jungle at worst.

Space efficiency is a ratio of data chunks to all ones in a object

That’s quite understandable. While we use a space factor like 1.5x they suggest a ratio like 8/12. Tomayto tomahto.

The third parameter of SHEC (=c) is a durability estimator, which approximates the number of OSDs that can be down without losing data.

This is where it gets a bit hairy. It’s both an “estimator” and also somehow an “approximation” of how durable the cluster will be. So what should I do when I want to tolerate 4 simultaneous failures? Set it to 6 and let someone approximate my intention to 4? Or set it to 3 so that it can estimate that what I wanted was 4, but I was just too shy to just say it? Nothing?

OK, how about the reduction in recovery load I’m after?

Describing calculation of recovery efficiency is beyond the scope of this document.

While the alternatives have their recovery numbers in their documents pretty clearly laid out, numbers for SHEC are somehow beyond the scope. And it feels like we won’t be able to understand this one bit, using first grade math or above. But we still push forward to find the right article with hopefully the right scope.

recovery efficiency, is defined as the inverse number of recovery overhead (recovery overhead means the ratio of read chunks to all data chunks during the recovery)

Well, that’s not very complicated now is it? What’s all the mystery? In fact we just used a similar and a bit more sensible number in our first grade math exercises in the previous post in the name of recovery factors for drive and network. If, again, a ratio is what they’re after, it’s always 1 in RS. If it’s the inverses they need for some reason it’s still 1. That’s easy. Let’s see how they formulate it:

…

It seems it was beyond the scope of that article as well. So we still have no idea, but at least this time they provided some pre-calculated numbers for us to get a glimpse of what to expect.

k	m	l	ml/k	Space effic.	Durability (Annual)	Rcvr-Ovhd (1x/2x fail)
4	2	4	2	67%	1.44E-17	1.00/1.00
4	3	3	2.25	57%	1.60E-18	0.75/1.00
6	4	3	2	60%	3.46E-18	0.50/0.74
5	3	3	1.8	63%	1.22E-10	0.60/0.90
7	4	3	1.71	64%	1.65E-10	0.43/0.69

It seems like the recovery overhead for a single drive failure is, well, l/k. And the recovery efficiency they just defined would be k/l. But they don’t want to use that anymore, I guess “overhead” was enough after all. Assuming that’s all correct, the recovery factor in our terms is just l. Default RS with 7+4 will have a 7x recovery drive/network factor, but if we use SHEC 7+4+3 as in the last example we’d have 3x. Easy huh?

Back to our evaluation, no free lunch here. In order to improve our recovery in any amount we need to sacrifice our space or durability or both. Space side is much like LRC and we already said that we have no spare capacity. On the durability side, it will be veeery hard to explain that we can tolerate up to 1.71 simultaneous drive failures but we can recover faster. And by hard, I mean impossible. This will immediately be rounded down to 1. And be thrown off.

By the way, I totally understand the motivation to speak in terms of more informative numbers like annual durability, AFR or MTTDL, instead of a boring number of simultaneous failures to tolerate. Total durability is what really matters and even if we have an easily understandable round number we’d still have to calculate them anyway, but easily understandable round numbers like good-old 2 are really helpful to have a clear picture of what to expect.

Wasting enough time on this, we move on to the last alternative: CLAY. And we immediately find what we’re looking for: a formula for total amount of disk IO during recovery. And with a bit more digging they also provided one for the network load. With no estimations, approximations or guesses to move around, we can calculate our recovery factors:

k	m	RS drive	RS network	CLAY drive	CLAY network
2	1	2	2	2.000	2.000
2	2	2	2	1.500	2.000
4	1	4	4	4.000	4.000
4	2	4	4	2.500	3.000
4	3	4	4	2.000	2.667
6	2	6	6	3.500	4.000
6	3	6	6	2.667	3.333
6	4	6	6	2.250	3.000
8	2	8	8	4.500	5.000
8	3	8	8	3.333	4.000
8	4	8	8	2.750	3.500
9	3	9	9	3.667	4.333
10	4	10	10	3.250	4.000
12	4	12	12	3.750	4.500
16	4	16	16	4.750	5.500

Having 2x-3x savings without any loss in capacity or durability? And all that after a plugin change only? This is the free lunch we’re after. So we started testing. First, the baseline with defaults.

The test ran on a cluster with 9 nodes using the jerasure plugin with parameters 4+2. We’ve filled the cluster to a degree, because you know… it has to recover something. Then stopped all reads and writes, to minimize the errors in measurements, and then stopped an OSD to simulate a failure. These are the numbers we noted down:

Recovery speed : 2415 MiB/s
Total disk write throughput : 632 MiB/s
Total disk read throughput : 2489 MiB/s
Total network throughput : 2667 MiB/s

Recovery speed is the number reported by ceph and it’s actually the size of the total object that was being recovered. Whenever we repair a single lost chunk of say 1 byte, ceph says it’s recovered an object of total 4 bytes - because k=4. So these numbers are close enough to what we would expect for an algorithm reading/transferring 4x the amount of data to actually recover. The small differences can be explained by the metadata for disk side, and packaging/messaging for the network side, and then some … maybe.

Next test was done on the same cluster with the same parameters but using the clay plugin. We of course had to reset the pools, re-fill data etc, and here are the numbers:

Recovery speed : 2035 MiB/s
Total disk write throughput : 601 MiB/s
Total disk read throughput : 1481 MiB/s
Total network throughput : 1939 MiB/s

First of all this proves the drive and network numbers we’ve seen in the clay article. The results are pretty close to what we would expect from the formulas alone. The bit difference in the network throughput can again be attributed to the packaging/messaging overhead.

Second thing we see is that clay, while reading and transferring a lot less data than jerasure/RS, is actually recovering almost 15% slower. This can be because this is a totally unscientific test with only single repetition and disks do not always work at the same speed, network does not always operate with same smoothness, ambient temperature cannot be the same etc. Or it may be because the clay is a more complicated algorithm so it will take more time run it. If that’s the case, it means even if you have massive savings in the amount of data read/transferred, it may not always convert to faster recovery. There ain’t no such thing as a free lunch after all.

In our case, the production cluster needed to have parameters 8+4 and the networking side was a bit fuzzy, as in, we were able to saturate some links pretty easily, we went ahead with clay. Because as the number of data chunks grows, the reduction in network/disk factors is truly substantial. As we can see in the last table above, the savings are about 25-37% for a 4+2 system and 65-70% for 16+4. At that point it’s beyond just some savings though. It might just be the point that’s going to make that design a reality.

The third and final thing we learned, or rather remembered, is that it pays to test with the actual workload. Because the final answer is, as always, it depends.