All disks in Spectrum Scale have a few pieces of information at fixed positions:
- Sector 1 is the “File System unique ID”. This will be matched in the file system descriptor to a Spectrum Scale disk name, and is written to the disk when it is added to the file system with
- Sector 2 is the NSD ID, which will matched to a disk name in the Spectrum Scale configuration (stored in
/var/mmfs/gen/mmsdrfs). It is written by
- Sector 8 is the File System Descriptor (FSDesc).
The file system descriptor is a data structure describing the file system and its state, and is used by Spectrum Scale to determine where disks fit into the file system. Every disk has a replica, but … if there are more than three disks, only several are guaranteed to be up to date!
We can use the
mmlsdisk -L command to see which disks have a current FSDesc:
# mmlsdisk scale.hub -L disk driver sector failure holds holds storage name type size group metadata data status availability disk id pool remarks ------------ -------- ------ ----------- -------- ----- ------------- ------------ ------- ------------ --------- fs8ahub01 nsd 512 1 Yes Yes ready up 5 system desc fs8bhub01 nsd 512 2 Yes Yes ready up 6 system desc fs8chub01 nsd 512 3 Yes Yes ready up 7 system desc fs8ahub02 nsd 512 1 Yes Yes ready up 8 system fs8bhub02 nsd 512 2 Yes Yes ready up 9 system fs8chub02 nsd 512 3 Yes Yes ready up 10 system Number of quorum disks: 3 Read quorum value: 2 Write quorum value: 2
How many replicas are kept current?
The rules are bit complicated, as it depends on both the number of disks, and the number of failure groups:
- If there are at least five failure groups, then five replicas are created, each in a different failure group.
- Otherwise, if there are at least three disks, then three replicas are created.
- If there are at least three failure groups, the replicas will be distributed so that each is in a different failure group.
- Otherwise create a replica on each disk (there are only one or two disks).
When a file system is mounted, Spectrum Scale needs to find a majority (more than half) of the FSDesc replicas. This can be an issue when failure groups are being used to create a highly available file system. Generally, we expect that if we have two failure groups, we can tolerate the loss of one of the failure groups. However, with two failure groups (and three or more disks), one of those failure groups will necessarily have two of the failure groups. Should that be the failure group we lose, the file system will not be able to be mounted.
Figure 1 shows two failure groups. Since it has more that three disks, but less than five failure groups, there will be three file system descriptor replicas kept current, as marked by the FSDesc on three of the disks. To mount the file system, at least two replicas must be available. If failure group 2 should be down, the file system can be mounted. However, if failure group 1 is down, there are not be enough replicas in failure group 2 available to mount the file system.
A highly available cluster must have at least three failure groups. In many cases, we only need a third failure group to hold the file system descriptor, and it is only needed in an emergency. We don’t want disks in this failure group to hold either data or metadata. We can do this by adding these disks with the type
descOnly. If there is more than one disk in this failure group, only one will be guaranteed to have a current file system descriptor.
descOnly disk should have at least 128MiB. Only file system descriptors are written to these disks. However, designating a disk as
descOnly does not itself force a file system descriptor to be written to it. File system descriptors can go on any disk, and
descOnly disks are only guaranteed to be chosen when no other disks are available when following the rules for choosing replicas. (In other words,
descOnly repels data and metadata, but does not itself attract current file system descriptor replicas.)
Storage pools are not considered when file system descriptor replica disks are chosen. If disks in different storage pools are logically in the same failure domain, they should be marked as belonging to the same failure group. Once there are more than three failure groups, we loose control of being to force placement of failure groups onto particular disks. (There is one exception: if we have exactly five failure groups, we know that each has a copy of a file system descriptor, and we can exploit that fact to force placement of file system descriptors.)
Cluster and file system descriptor quorum
Highly available clusters need to consider not just cluster quorum but also file system descriptor quorum. In particular, we want to avoid being surprised by having cluster quorum, but not being able to mount a file system because we can not access sufficient file system descriptor replicas. A good way to ensure this is to make sure the quorum nodes are integral to the failure groups in some way. For example, if all disks are logically in two failure groups, each failure group could have a quorum node associated with it. A “tiebreaker” quorum node could include a small disk (or partition) for each file system in the cluster, marked as
descOnly, belonging to a third failure group.
Tiebreaker disks vs file system descriptor replicas
The notion of file system descriptor quorum is often confused with Spectrum Scale’s tiebreaker disk quorum mode. Indeed, tiebreaker disks can also be marked as
descOnly disks. However, tiebreaker disk quorum can only be used when each quorum node can access each tiebreaker disks. Generally, three tiebreaker disks are used, and they suffice for the entire cluster. Additionally, each file system in the cluster needs to satisfy the requirements of having a majority of its file system descriptor replicas available. While the file system descriptor replicas can be served only by a single “tiebreaker” quorum node, all tiebreaker disks need to be visible to all quorum nodes.
Working with file system descriptors
mmfsctl command can be used to temporarily exclude disks from quorum considerations.
Determine which disks have up to date file system descriptors:
mmlsdisk FSNAME -L
mmadddisk command can add an NSD with a stanza for a
descOnly disk (at least 128 MiB). A stanza would look like:
%nsd: device=/dev/sda3 nsd=Gpfs01a3 servers=gpfs01 usage=descOnly failureGroup=50
To force migration of an active replica from a disk:
1. Suspend the disk. This forces the migration to another candidate (assuming there is one).
2. Resume the disk. Unless it is the only candidate for a required replica, it will no longer have an active copy.
mmfsctl FSNAME exclude command can be used to exclude disks from having an active file system descriptor, and may be used in emergencies where file system descriptor quorum can no longer be obtained. The
mmfsctl FSNAME include command may be used to restore eligibility to disks to receive a replica.