Many factors go into determining the performance of a storage system, and a common question is: What will be the performance of this storage system? Ultimately, actual performance can only be determined through benchmarking with the actual workloads, but this isn’t always possible when planning to acquire a storage system. We need a way to model the storage system and predict the performance. Fundamental is the predicting the performance of the underlying “disk” system itself.
With a parallel storage system with many tasks, the underlying “disk” storage system sees a random I/O workload. This is true even if all the tasks are handling sequential I/O! A common mistake in planning for a storage system is to assume that sequential disk performance is the only performance factor that matters. In reality, we need to consider the rate of I/O operations (IOPs) we can perform, as well as the latency of the storage system. These factors are all interrelated.
Modeling performance with spinning disks
With a spinning disk, the amount of time needed to perform a disk IOP is:
TIOP = TC + TS + TRL + b⁄R
where
TC = time needed for controller to decode request, etc.
TS = seek time, i.e, the time needed to move the disk head to the correct track.
TRL = rotational latency, i.e. time needed for correct sector to rotate under the disk head
b = number of bytes transferred
R = data transfer rate (bytes per second)
The outer tracks of a spinning disk are longer than the inner tracks, even though the same amount of time is needed for an entire track to pass under the disk head as an inner track. Because the outer tracks are also longer, they are formatted with additional sectors. Consequently, more data can be transferred in the time it takes these tracks pass under the disk head. In other words, R is a function of which track number.
The TS and TRL values depend on where the last data transfer took place. It is sensible to approximate TRL as half the time it takes the disk to rotate. More complicated is TS, which is not a linear function of the distance. We know it isn’t quite precise to use an average seek time, but that is the best we can do.
Short stroking is a strategy of only using the outer tracks of a spinning drive. This has the effect of reducing the distance the disk head needs to move, reducing the seek time. Additionally, since only outer tracks are used, the data transfer speed is increased, shortening the transfer time.
Modeling performance with solid state drives
Solid State Drives (SSDs) have neither disk heads nor spinning disks. There is no seek time, and there is no rotational latency! But there is still some time needed for the controller to find the data, and there is still some time needed to transfer data. We can compute the time needed for an IOP as:
TIOP = TC + b⁄R
Astute readers will have noticed that this is the same as our earlier formula, assuming we set both TS and TRL to 0.
Performance characteristics of real disks
In general, it is not possible to get actual values for TC or TS. But in practice, it is not necessary. Since we will be using constant average values for TS and TRL, we simplify by defining the “overhead time”:
TO = TC + TS + TRL
Thus:
TIOP = TO + b⁄R
We have some real world data for IOP times for some spinning drives and SSDs:
Medium | IOPs/s | Data transfer rate, MB/s |
---|---|---|
NL SAS drive 100% 4k Random IOPs | 75-100 | 100 |
10k SAS drive 100% 4k Random IOPs | 200 | 125 |
SSD 100% 4k random reads. | 20,000 | 300 |
SSD 100% 4k random writes | 4,000 | 200 |
With a little algebra, we can determine TO.
One thing to notice is how small the IOPs/s actually is for a spinning disk. The overhead time is what dominates — high bandwidth requires the transfer sizes to be as large as possible.
Traditional RAID
A volume or LUN on a RAID group with parity is organized as consecutive fixed-length stripes. Each stripe consists of fixed-length segments or strips of consecutive bytes on each data disk. Parity is computed on the entire stripe, then written to the parity disks.
A read IOP will read from enough disk segments, in parallel, to satisfy the request. There is no penalty for reading from a RAID volume. However, note that we are now treating the entire volume almost like it were a single disk — the IOPs/s for reading from a RAID volume is the same as if we read from a single disk.1
Writes are more complicated. Writing to a volume requires the entire stripe to first be read into a buffer. Then the portion of the buffer affected by the write IOP is modified. Then the parity of the stripe needs to be recomputed — this would not be possible without the contents of the entire stripe. Finally, the entire stripe written (in parallel) along with the parity.
Generally this involves 2 IOPs on each disk, so write IOPs/s on a RAIF volume is half that of a single disk. A special case is when the write IOP is exactly one stripe — in this case, the initial read IOP may be skipped, since the entire buffer will comprise of the data being written. With Spectrum Scale, this special case happens when we align the file system block size with the stripe size.
Features such as compression defeat our ability to align block size to stripe size, meaning we can not avoid the RAID penalty.
Some calculated performance numbers for traditional RAID
Using these observations and data, we can calculate performance of traditional RAID groups for transferring entire stripes.
Data disks | Stripe (MB) | Transfer speed (MB/s) | 4k IOPs/s | Segment size (MB) | IOP overhead (s) | LUN IOPs/s | LUN MB/s | Comments |
---|---|---|---|---|---|---|---|---|
4 | 1 | 100 | 100 | 0.25 | 0.009960938 | 80.25 | 80 | |
8 | 1 | 100 | 100 | 0.125 | 0.009960938 | 89.20 | 89 | |
4 | 2 | 100 | 100 | 0.5 | 0.009960938 | 66.84 | 134 | |
8 | 2 | 100 | 100 | 0.25 | 0.009960938 | 80.25 | 161 | |
4 | 4 | 100 | 100 | 1 | 0.009960938 | 50.10 | 200 | |
8 | 4 | 100 | 100 | 0.5 | 0.009960938 | 66.84 | 267 | |
4 | 8 | 100 | 100 | 2 | 0.009960938 | 33.38 | 267 | |
8 | 8 | 100 | 100 | 1 | 0.009960938 | 50.10 | 401 | |
4 | 1 | 300 | 20000 | 0.25 | 3.69792E-05 | 1149.01 | 1149 | SSD read |
4 | 1 | 200 | 4000 | 0.25 | 0.000230469 | 675.46 | 675 | SSD write |
These data may make it enticing to always use 8 data disks (RAID6 8+PQ). However, one must also consider how many RAID groups pack into an entire RAID storage enclosure. It is often the case that using RAID6 4+PQ will take better advantage of the IOPs available in an enclosure than RAID6 8+PQ simply because there can be more such RAID groups — albeit at a loss of capacity.
- If we have multiple read requests to the same volume, where those requests can be satisfied by segments on different disks, a high performance controller might be able to satisfy these requests in parallel. This might allow more IOPs/s from the RAID group to be possible than if we simplistically treat it as if it were a single disk.
All this assumes that we have some strong means to detect read errors, like T10-PI. Older disks rely on a weak sector checksum. Undetected read errors (UREs) escape this weak checksum. Without T10-PI or equivalent, the only option to ensure data integrity is to read the entire stripe, including the parity disks, and validate the data. If this is required, it is certainly the case that the IOPs/s for the RAID volume is the same as the IOPs/s of the slowest disk in the RAID group. ↩