File System Descriptor Quorum in Spectrum Scale

All disks in Spectrum Scale have a few pieces of information at fixed positions:

  • Sector 1 is the “File System unique ID”. This will be matched in the file system descriptor to a Spectrum Scale disk name, and is written to the disk when it is added to the file system with mmcrfs, mmadddisk, or mmrpldisk.
  • Sector 2 is the NSD ID, which will matched to a disk name in the Spectrum Scale configuration (stored in /var/mmfs/gen/mmsdrfs). It is written by mmcrnsd.
  • Sector 8 is the File System Descriptor (FSDesc).

The file system descriptor is a data structure describing the file system and its state, and is used by Spectrum Scale to determine where disks fit into the file system. Every disk has a replica, but … if there are more than three disks, only several are guaranteed to be up to date!

We can use the mmlsdisk -L command to see which disks have a current FSDesc:

    # mmlsdisk scale.hub -L
    disk         driver   sector     failure holds    holds                                    storage
    name         type       size       group metadata data  status        availability disk id pool         remarks
    ------------ -------- ------ ----------- -------- ----- ------------- ------------ ------- ------------ ---------
    fs8ahub01    nsd         512           1 Yes      Yes   ready         up                 5 system        desc
    fs8bhub01    nsd         512           2 Yes      Yes   ready         up                 6 system        desc
    fs8chub01    nsd         512           3 Yes      Yes   ready         up                 7 system        desc
    fs8ahub02    nsd         512           1 Yes      Yes   ready         up                 8 system
    fs8bhub02    nsd         512           2 Yes      Yes   ready         up                 9 system
    fs8chub02    nsd         512           3 Yes      Yes   ready         up                10 system
    Number of quorum disks: 3
    Read quorum value:      2
    Write quorum value:     2

How many replicas are kept current?

The rules are bit complicated, as it depends on both the number of disks, and the number of failure groups:

  • If there are at least five failure groups, then five replicas are created, each in a different failure group.
  • Otherwise, if there are at least three disks, then three replicas are created.
    • If there are at least three failure groups, the replicas will be distributed so that each is in a different failure group.
  • Otherwise create a replica on each disk (there are only one or two disks).
FSDesc-2fg
Figure 1: File system descriptor replicas with two failures groups and four disks

When a file system is mounted, Spectrum Scale needs to find a majority (more than half) of the FSDesc replicas. This can be an issue when failure groups are being used to create a highly available file system. Generally, we expect that if we have two failure groups, we can tolerate the loss of one of the failure groups. However, with two failure groups (and three or more disks), one of those failure groups will necessarily have two of the failure groups. Should that be the failure group we lose, the file system will not be able to be mounted.

Figure 1 shows two failure groups. Since it has more that three disks, but less than five failure groups, there will be three file system descriptor replicas kept current, as marked by the FSDesc on three of the disks. To mount the file system, at least two replicas must be available. If failure group 2 should be down, the file system can be mounted. However, if failure group 1 is down, there are not be enough replicas in failure group 2 available to mount the file system.

descOnly disks

A highly available cluster must have at least three failure groups. In many cases, we only need a third failure group to hold the file system descriptor, and it is only needed in an emergency. We don’t want disks in this failure group to hold either data or metadata. We can do this by adding these disks with the type descOnly. If there is more than one disk in this failure group, only one will be guaranteed to have a current file system descriptor.

A descOnly disk should have at least 128MiB. Only file system descriptors are written to these disks. However, designating a disk as descOnly does not itself force a file system descriptor to be written to it. File system descriptors can go on any disk, and descOnly disks are only guaranteed to be chosen when no other disks are available when following the rules for choosing replicas. (In other words, descOnly repels data and metadata, but does not itself attract current file system descriptor replicas.)

Storage pools are not considered when file system descriptor replica disks are chosen. If disks in different storage pools are logically in the same failure domain, they should be marked as belonging to the same failure group. Once there are more than three failure groups, we loose control of being to force placement of failure groups onto particular disks. (There is one exception: if we have exactly five failure groups, we know that each has a copy of a file system descriptor, and we can exploit that fact to force placement of file system descriptors.)

Cluster and file system descriptor quorum

Highly available clusters need to consider not just cluster quorum but also file system descriptor quorum. In particular, we want to avoid being surprised by having cluster quorum, but not being able to mount a file system because we can not access sufficient file system descriptor replicas. A good way to ensure this is to make sure the quorum nodes are integral to the failure groups in some way. For example, if all disks are logically in two failure groups, each failure group could have a quorum node associated with it. A “tiebreaker” quorum node could include a small disk (or partition) for each file system in the cluster, marked as descOnly, belonging to a third failure group.

Tiebreaker disks vs file system descriptor replicas

The notion of file system descriptor quorum is often confused with Spectrum Scale’s tiebreaker disk quorum mode. Indeed, tiebreaker disks can also be marked as descOnly disks. However, tiebreaker disk quorum can only be used when each quorum node can access each tiebreaker disks. Generally, three tiebreaker disks are used, and they suffice for the entire cluster. Additionally, each file system in the cluster needs to satisfy the requirements of having a majority of its file system descriptor replicas available. While the file system descriptor replicas can be served only by a single “tiebreaker” quorum node, all tiebreaker disks need to be visible to all quorum nodes.

Working with file system descriptors

The mmfsctl command can be used to temporarily exclude disks from quorum considerations.

Determine which disks have up to date file system descriptors:
mmlsdisk FSNAME -L

The mmadddisk command can add an NSD with a stanza for a descOnly disk (at least 128 MiB). A stanza would look like:

    %nsd: device=/dev/sda3
        nsd=Gpfs01a3
        servers=gpfs01
        usage=descOnly
        failureGroup=50

To force migration of an active replica from a disk:
1. Suspend the disk. This forces the migration to another candidate (assuming there is one).
2. Resume the disk. Unless it is the only candidate for a required replica, it will no longer have an active copy.

The mmfsctl FSNAME exclude command can be used to exclude disks from having an active file system descriptor, and may be used in emergencies where file system descriptor quorum can no longer be obtained. The mmfsctl FSNAME include command may be used to restore eligibility to disks to receive a replica.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s