Where to do it? Compression

Where in the software stack should we compress data?

My answer is, do compression as high in the storage stack where it works well.

The storage stack

In other words, if an application data can do its own compression, let it compress. Compression relies on reducing redundancies in the data set – if something can be efficiently computed from other parts of the data set, then it is redundant. If an integer will always fall in a small range, than we only need enough bits to store that range. The application programmers have the greatest knowledge of redundancies and efficiencies in the data set, putting them in the best position to compress the data most effectively.

But many applications don’t compress data. Some file systems have compression capabilities built into them. In particular, Spectrum Scale compresses files under the control of its policy engine – so not every file is compressed. This is important because a file that is already compressed will rarely shrink if another compression algorithm is applied to it. Compressed archive files, JPEG files, MP3 files, etc. are already compressed and typically should not be compressed again, and Spectrum Scale can choose not to apply its own compression algorithm to those files. Additional, Spectrum Scale has some purpose-build compression algorithms for genomics data that can be selected by matching file types.

The drawback to compressing in the file system layer is that we may incorrectly identify which files should be compressed, leading to some files not being compressed that could have been, and other files being repetitively compressed. Usually metadata is not compressed at all at the file system layer – and Spectrum Scale never compresses hot data. (In Spectrum Scale terms, compression is under the control of file management policies, never file placement policies. Policies can be used to select files that should compress well, and only compress those files. Spectrum Scale offers a few different compression algorithms for different types of files, and the policies can choose the most appropriate algorithm for a particular class of data.)

Depending on the storage system, we may be able to compress at the block storage level of the storage stack. Spectrum Virtualize, for instance, can provide block storage to Spectrum Scale, compressing it before storing it on physical media that doesn’t have its own compression capability. This will compress both metadata and hot data. However, the block storage layer has absolutely no awareness of what is actually in the blocks of data it is managing! This can lead to attempts to compress blocks belong to files that could have been identified in a higher layer as being a poor candidate for compression. A more serious issue arises in that compression is usually done in conjunction with providing a thin-provisioning capability. The benefit of thin-provisioning is that we can usually store more in a block storage system than would be expected based on the physical storage capacity. However, a file system layered on the block storage system may not be aware of thin-provisioning and not be prepared for finding that no space is available in a block device that is “large enough”. Extreme caution is needed when thin-provisioning is used with file systems.

Lastly, the physical media itself may have its own compression capabilities. Usually this is also offered in conjunction with a block storage system that offers thin-provisioning capabilities. This compression is done in hardware, and with respect to performance, this is the most efficient location to do compression, for data that does in fact compress. However, this layer has the least awareness of what can and can’t be compressed. The presence of compression capabilities in the data path does increase latency, As noted above, thin-provisioning must also be used with caution.

Besides knowledge of what can and can’t be compressed, there are two other reasons to put compression higher in the storage stack. First, storage performance will depend in part in how much data is written between storage layers. If 10 GiB data written compresses in the file system layer to 5 GiB, that effectively doubles the performance of the file system (especially with a parallel file system, where compression is distributed over the application nodes, reducing its performance impact). If compression is done by the physical storage layer, there will be little performance benefit to compression.

Second, there is the interaction of compression and encryption. With today’s encryption and compression algorithms, encrypted data does not compress well. The redundancies that compression algorithms depend on finding are hidden by the encryption algorithm – that is generally seen as necessary to prevent an attacker from getting information from the cipher text. As we’ll see, encryption can also happen in different layers of the storage stack – but if compression is happening low in the stack, the probability is greater that it will be attempting to compress encrypted data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s