Failure groups in Spectrum Scale

When GPFS was first developed, RAID storage was not commonplace. To protect data, a file system could be configured so that each piece of data could be have a replica written to a second location. Over time, this mechanism was extended so that data could also be written to a third location.

Functionally, there is no distinction between the first instance of data written, and the optional copy and second copy. To simplify the discussion, we will just refer to all instances of data being written as replicas. So we can say that every piece of file data (and metadata) may be written as 1, 2, or 3 replicas, depending on configuration.

Failure domains

Copies of data do us no good if we potentially could lose both the original and the copy. This could happen, for instance, if all replicas were written, not to the same volumes, but to different volumes in the same disk enclosure — and a hardware problem disables the entire enclosure — and frustratingly, we had other disk enclosures that did not fail. What we need is a way to indicate to Spectrum Scale that certain volumes might potentially fail together, so multiple replicas of a piece of data should not be written to the same set of volumes.

The way this is done is to assign volumes to failure groups. Conceptually, a failure group is a collection of volumes residing in a common failure domaain, fault-tolerant groups of hardware that could fail as a group. A failure domain might be a disk enclosure, or all the disk enclosures in a single rack, or all the volumes in a single data center. Failure domains need to be chosen judiciously. If too small a domain is chosen, we risk losing all replicas when we may have been able to stay available. On the other hand, fewer failure domains might reduce opportunities for parallelism, reducing performance.

Replicating over failure groups

Each file has an associated replication factor. A file system has a default replication factor, but the policy engine may be used to change the replication factor of individual files. Metadata is also replicated, and may have a different replication factor from file data.

failure-groups
Striping over failure groups

Files are written as sequences of file systems blocks. Each block “striped” over the volumes in the file system, in a round-robin fashion. Normally this means that consecutive blocks of a file are written to consecutive volumes in the file system. With replication, each block is written multiple times (as many as the replication factor), and no two replicas of the same block are ever written to the same failure group. This means that the loss of a single failure group can only make at most one replica of each block of the file unavailable, leaving other replicas in other failure groups.

A special case is the failure group -1. A disk in failure group -1 is considered to be in a different failure group than any other disks, including all other disks in failure group -1. If a disk is added to a file system without the failure group being designated, it will be tagged as failure group -1.

To summarize, Spectrum Scale ensures for any replicated file or piece of metadata:

  • Each file system block has the requisite number of replicas.
  • Each replicated file system block is in a different failure group.

Failure groups instead of RAID

There are no checksums or other mechanisms used to validate that a replica of a file block is correct. Data integrity must start with the storage system itself! The storage system must be able to detect data corruption such as UREs (uncorectable read errors) and report it as a failure to read the data — this will trigger Spectrum Scale to use a different replica.

Consequently, failure group replication is not a substitute for RAID!

However, there are checksums with metadata. If metadata is lost or corrupted, the entire file system may be lost.

Configuring failure groups

Each volume (“disk”) is assigned by the administrator to a failure group as it is added to a file system.

The file system must first be created with mmcrfs to support replication — the maximum replication factors for both data and metadata may not be changed later! Use the -M option to set the maximum replication factor for metadata, and the -R option to set the maximum repication factor for data:

mmcrfs fs1 -F fs1.stanza -m 2 -M 3 -r 2 -R 3 \
    -Q yes -A yes -i 4k -S relatime --filesetdf -k all \
    -T /scale/fs1

The default replication factors may be less than the maximum replication factors. In our example, we replicate both data (the -r option) and metadata (the -m option) twice. The default setting could be changed in the future.

When adding disks to a file system, be sure to include a failureGroup clause in the stanza, as we see in this fragment from a stanza file, fs1-new.stanza:

%nsd:
    nsd=d1
    device=/dev/dm-2
    servers=scale01,scale02
    failureGroup=1

%nsd:
    nsd=d2
    device=/dev/dm-4
    servers=scale02,scale01
    failureGroup=2

%nsd:
    nsd=d3
    device=/dev/dm-6
    servers=scale01,scale02
    failureGroup=1

%nsd:
    nsd=d4
    device=/dev/dm-7
    servers=scale02,scale01
    failureGroup=2

We can add these disks to a file system using the normal Spectrum Scale commands:

mmcrfs fs1 -F fs1.stanza -m 2 -M 3 -r 2 -R 3 \
  -Q yes -A yes -i 4k -S relatime --filesetdf -k all \
  -T /scale/fs1

Failure groups vs Storage Pools

Sometimes people failure groups with storage pools. Both are groups of related volumes. However, failure groups are used to distinguish between volumes in different failure domains. Storage pools are used to distinguish between volumes by storage device class.

The Spectrum Scale policy engine can change the storage pool in which a file is stored, and it can change the replication factor of a file. However, it can not specify which failure groups should be used!

All file system blocks belonging to a single file are written to a single storage pool, even though they may be replicated over multiple failure groups.

Guidance on failure groups

Judicious use of replication enables updating file system components (NSD servers, disk firmware, etc.) while the file system remains active.

Whenever possible, at least replicate metadata, or at least set the maximum replication factor for metadata to 2 or 3. There is a small space penalty for choosing a larger maximum replication factor, but usually it is worth setting the maximum replication factor for data to at least 2, just to get the flexibility of being able to replicate precious data later.

A common misconception is that the number of failure groups in a file system must match the replication factor. This is not true, and usually the number of failure groups exceeds the replication factor.

You need more than two failure groups to ensure the file system can be mounted after the loss of (part of) one failure group! This is because of the need to maintain File System Descriptor (FSDesc) quorum. Both 2 and 4 failure groups may lead to problems with FSDesc quorum. Choose either 3 failure groups, or 5 or more.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s