Cluster quorum and Spectrum Scale

Spectrum Scale systems are organized into clusters. File systems and underlying resources are owned by clusters, managed by clusters, and possibly exported to remote clusters. Within a cluster, there is a need for certain management functions. Because we would like the cluster to remain active even in the face of systems or network links failing, we do not want to tie these management functions to individual systems. Rather, Spectrum Scale desinates certain systems to be quorum nodes, which will elect a cluster manager from amongst themselves. The cluster manager will in turn choose from the potentially larger pool of manager nodes the systems that will perform the other functions, most notably the file system manager.

Disk leasing

Disk leasing is the mechanism used to coordinate systems’ access to disks.
A system node must hold a valid lease on a disk (NSD) in order to do any I/O on that disk. Nodes request disk leases, or renewals of existing disk leases, from the cluster manager. The cluster manager is coordinating and tracking disks leases for all nodes in the cluster. A disk lease is for a period of time, so if one expires without renewal, the cluster manager knows the disk is no longer in use.

Heartbeat messages are also used to ensure that nodes remain alive — as well as tracking that the cluster manager itself remains alive! If a node is marked as unresponsive, any disks that node was using may be leased to other nodes — after existing leases held by the unresponsive node have expired! The reason for waiting is that the cluster manager cannot know if the node is unresponsive because the Spectrum Scale instance has actually failed, or if the network link between node and cluster manager has failed — if the latter, the node may continue using the disk until the lease has expired.

Alternatively, if all disks support SCSI persistent reserve, it can be used to fence off disks from non-responsive nodes without waiting for leases to expire. This speeds recovery from node failures.

Why quorum is needed

Consider the scenario in Figure 1: Nodes 3 and 4 can’t communicate with Nodes 1 and 2. What should nodes 3 and 4 do?

quorum-lan-split — Figure 1: Partitioned cluster

Are Nodes 1 and 2 dead? If so, ideally Nodes 3 and 4 will keep working. Or is it simply we can’t communicate? In that case, perhaps Nodes 3 and 4 should stop, so we don’t corrupt data. Or perhaps it is Nodes 1 and 2 that should stop, and Nodes 3 and 4 should continue?

The situation we must not allow to happen is the so called “split brain”. This is when the cluster is partitioned, but both partitions continue to access disks. That would ultimately lead to data corruption, and must not be permitted to happen!

This boils down to: where should the cluster manager be running? Nodes that can communicate with the cluster manager may continue to take out and renew disk leases; nodes that cannot communicate with the cluster manager will have their leases expire and are “expelled” from the running cluster. The purpose of quorum is to elect a single cluster manager (so no “split brain”), maintaining consistency at the expense of some availability.

Consistency, Availability, and Partition tolerance

What we would really like is a system that has the following characteristics:

Consistency: All nodes see the same, consistent view of the data.
Availability: All nodes can access the data.
Partition tolerance: The system can continue to operate even if the network is broken.

Unfortunately, Brewer’s theorem, also known as the CAP theorem, demonstrates that it is impossible for a distributed computing system to simultaneously guarantee all three characteristics!

All distributed systems will occasionally experience some partitioning, so we need partition tolerance. POSIX file systems, to meet the POSIX standards expected by applications, must meet some strict requirements on consistency. Quorum algorithms maintain this consistency and partition tolerance, at the cost of some availability (all but one partition must be expelled).

Cluster quorum in Spectrum Scale

Spectrum Scale has two different mechanisms for computing cluster quorum:
1. Node quorum
2. Tiebreaker disk quorum

Both require quorum nodes to be defined, one of which will be elected to be the cluster manager. Beyond that, the actual semantics are different — tiebreaker disk quorum is not merely node quorum with an additional test.

Node quorum

quorum-partition-3Q — Figure 2: Partition with 3 quorum nodes

Quorum nodes attempt to communicate with each other to form a group. A group with a majority of the quorum nodes is quorate, and will elect the cluster manager from amongst its members.

Generally we use an odd number, e.g. 3 or 5, quorum nodes. It is recommended to have no more than 7 quorum nodes. A single node “cluster” can have a single quorum node (itself).

The maximum number of quorum nodes is 8, if the modern CCR configuration mechanism is used. “Traditional” non-CCR clusters can have up to 128 quorum nodes!

The more quorum nodes there are in a cluster, the longer it takes to select a cluster manager. Hence, it usually makes sense to use fewer quorum nodes, typically 3.

quorum-partition-4Q — Figure 3: Partition with 4 quorum nodes

Figure 3 demonstrates that more quorum nodes do not necessarily increase availability of the cluster. Neither partition has a majority of the quorum nodes, so neither side will elect a cluster manager. With only three quorum nodes, the cluster was able to form a quorum in one of the partitions.

Tiebreaker disk quorum

In some cases, node quorum will not be sufficient. The most common case is a cluster with only two nodes (or only two nodes suitable for being quorum nodes) — if either node should fail, the cluster will go down. But if both quorum nodes are attached to the same storage, _tiebreaker disks_ may be used to choose a cluster manager.

quorum-partition-4QT — Figure 4: Partition with 4 quorum nodes and tiebreaker disk

As shown in Figure 4, tiebreaker disk quorum may be used with more than two quorum nodes as well. However, the semantics are different than for node quorum:

Tiebreaker disks are 1-3 specially designated NSDs. In all other respects, they behave like other NSDs, and may also be used to serve data and metadata.
All quorum nodes must be directly- or SAN-connected to all the tiebreaker disks. Hence, tiebreaker disk quorum is not suitable for shared-nothing clusters.
A candidate node for being the cluster manager must be able to access a majority of the tiebreaker disks.
A candidate must be able to connect to at least minQuorumNodes quorum nodes (including itself). The default for minQuorumNodes is 1.
There is a maximum of 8 quorum nodes when tiebreaker disks are used.
Elections require disk leases on tiebreaker disks to expire. SCSI persistent reserve can make quorum election faster.

If an election is triggered, the node that is currently the cluster manager will remain the cluster manager if it remains a candidate.

The cluster shown in Figure 4 would follow these rules to choose one of the partitions. Assuming the node serving as the cluster manager was in the top partition, that node would remain as the cluster manager, and that partition would form the quorate group.

Tiebreaker disks are sometimes confused with a second kind of quorum called file system descriptor quorum, but they are in fact different. Tiebreaker disks must be connected to all quorum nodes, are an alternative to node quorum, and are used to maintain cluster quorum. File system descriptor quorum pertains to each file system in the cluster, and become a design consideration when there are multiple failure groups.

Considerations in choosing quorum nodes

Because it is so vital to maintain quorum in a cluster, quorum nodes need to be chosen carefully. Some of the considerations are:

Choose the smallest number of nodes that will meet all the necessary redundancy constraints.
In general, use tiebreaker disk quorum only when the cluster is small (like two node clusters). A cluster small enough to be using SAN-attached storage is a candidate for using tiebreaker disk quorum.
Choose reliable nodes. The quorum function is light weight and can be bundled with other light weight services.
Network traffic to the quorum nodes is light, but it is critical.
Try to isolate quorum nodes from common node failures.

While the quorum function is light, the CCR configuration mechanism found on quorum nodes has the vital function of saving configuration and state changes to files on each quorum nodes’ local disks. So, it is best that quroum nodes not otherwise be heavily doing I/O to the local disk too, at least not through the same data paths or I/O adapter as the /var/mmfs directory. Clusters have experienced trouble when CCR changes are not written quickly enough to disk. This use of a common I/O adapter might be a bit hidden — for instance, an ESS I/O node uses the same I/O adapter for accessing local disks as it uses to access its NVMe “logtip” storage.

Tuning quorum

There are a few configuration variables (set with mmchconfig) that can be used to tune how quorum works:

tiebreakerDisks: This is a semicolon-delimited set of NSD names (use 1 or 3). These will become the tiebreaker disks. To revert to node quorum, set this variable to no.
minQuorumNodes: When tiebreaker quorum is used, this sets the minimum number of quorum nodes that must be in a partition containing the cluster manager (default is 1).

There is also a callback, tiebreakerCheck, that may be used in conjunction with tiebreaker disk quorum. It can invoke tests to determine whether or not a node should be eligible to be the cluster manager.

The CCR mechanism has some timeouts ranging from 45 seconds to 2 minutes. These are not tunable, but extremely poor networks may cause trouble if these limits are reached.

Tuning disk leasing

The disk leasing mechanism can also be tuned. Some of the configuration variables are:

failureDetectionTime: How many seconds it takes to detect a node is down (default is 35 seconds, same duration as a disk lease).
leaseRecoveryWait: When a node fails, wait until known leases expire, then wait this many seconds, before starting recovery. Default is 35 seconds. Intent is to give “in-flight” I/O time to get through controllers on to disks.
usePersistentReserve: enables SCSI persistent reserve for disk fencing. Please check documentation for guidance.
minMissedPingTimeout, maxMissedPingTimeout: Sets the range in which calculated “missed ping timeout” (MPT) may fall (default between 3 and 60).
The default MPT is leaseRecoveryWait-5.
After a lease expires, the cluster manager will ping a node, waiting MPT seconds for a response. After that, the node is expelled.
totalPingTimeout: Nodes responding to ICMP pings but not sending heartbeats will be declared after this timeout (default 120 seconds).