Saturday, August 7, 2010

Understanding How Cluster Quorums Work

Based on conferences that I have attended and E-mails that I receive, it always seems to me that when it comes to clustering, quorums are one of the most commonly misunderstood topics. In order to effectively administer a cluster though, you need to understand what a quorum is and you need to know about the various types of quorums. In this article, I will explain what a quorum is and what it does. Since this tends to be a confusing topic for a lot of people, I will attempt to keep my explanations as simple as I can.

Clustering Basics

Before I can really talk about what a quorum is and what it does, you need to know a little bit about how a cluster works. Microsoft server products support two main types of clustering; server clusters and network load balancing (NLB). The design philosophy behind these two types of servers couldn't be more different, but the one thing that both designs share is the concept of a virtual server.

There are several different meanings to the term virtual server, but in clustering it has a specific meaning. It means that users (and other computers) see the cluster as a single machine even though it is made up of multiple servers. The single machine that the users see is the virtual server. The physical servers that make up the virtual server are known as cluster nodes.

Network Load Balancing

These two different types of clusters have two completely different purposes. Network Load Balancing is known as a share all cluster. It gets this name because an application can run across all of the cluster's nodes simultaneously. In this type of cluster, each server runs its own individual copy of an application. It is possible that each server can link to a shared database though.

Network Load Balancing clusters are most often used for hosting high demand Web sites. In a network load balancing architecture, each of the cluster's nodes maintains its own copy of the Web site. If one of the nodes were to go down, the other nodes in the cluster pick up the slack. If performance starts to dwindle as demand increases, just add additional servers to the cluster and those servers will share the workload. A Network Load Balancing cluster distributes the current workload evenly across all of the cluster's active nodes. Users access the virtual server defined by the cluster, and the user's request is serviced by the node that is the least busy.

Server Clusters

The other type of cluster is simply known as a server cluster. A server cluster is known as a share nothing architecture. This type of cluster is appropriate for applications that can not be distributed across multiple servers. For example, you couldn't run a database server across multiple nodes because each node would receive updates independently, and the databases would not be synchronized.

In a server cluster, only one node is active at a time. The other node or nodes are placed in a sort of stand by mode. They are waiting to take over if the active node should fail.

As you may recall, I said that server clusters are used for applications that can not be distributed across multiple nodes. The reason that it is possible for a node to take over running an application when the active node fails is because all of the nodes in the cluster are connected to a shared storage mechanism. This shared storage mechanism might be a RAID array, it might be a storage area network, or it might be something else. The actual media type is irrelevant, but the concept of shared storage is extremely important in understanding what a quorum is. In fact, server clusters is the only type of clustering that uses quorums. Network load balancing does not use quorums. Therefore, the remainder of this discussion will focus on server clusters.

What is a Quorum?

OK, now that I have given you all of the necessary background information, let's move on to the big question. What is a quorum? To put it simply, a quorum is the cluster's configuration database. The database resides in a file named \MSCS\quolog.log. The quorum is sometimes also referred to as the quorum log.

Although the quorum is just a configuration database, it has two very important jobs. First of all, it tells the cluster which node should be active. Think about it for a minute. In order for a cluster to work, all of the nodes have to function in a way that allows the virtual server to function in the desired manner. In order for this to happen, each node must have a crystal clear understanding of its role within the cluster. This is where the quorum comes into play. The quorum tells the cluster which node is currently active and which node or nodes are in stand by.

It is extremely important for nodes to conform to the status defined by the quorum. It is so important in fact, that Microsoft has designed the clustering service so that if a node can not read the quorum, that node will not be brought online as a part of the cluster.

The other thing that the quorum does is to intervene when communications fail between nodes. Normally, each node within a cluster can communicate with every other node in the cluster over a dedicated network connection. If this network connection were to fail though, the cluster would be split into two pieces, each containing one or more functional nodes that can not communicate with the nodes that exist on the other side of the communications failure.

When this type of communications failure occurs, the cluster is said to have been partitioned. The problem is that both partitions have the same goal; to keep the application running. The application can't be run on multiple servers simultaneously though, so there must be a way of determining which partition gets to run the application. This is where the quorum comes in. The partition that "owns" the quorum is allowed to continue running the application. The other partition is removed from the cluster.

Types of Quorums

So far in this article, I have been describing a quorum type known as a standard quorum. The main idea behind a standard quorum is that it is a configuration database for the cluster and is stored on a shared hard disk, accessible to all of the cluster's nodes.

In Windows Server 2003, Microsoft introduced a new type of quorum called the Majority Node Set Quorum (MNS). The thing that really sets a MNS quorum apart from a standard quorum is the fact that each node has its own, locally stored copy of the quorum database.

At first, each node having its own copy of the quorum database might not seem like a big deal, but it really is because it opens the doors to long distance clustering. Standard clusters are not usually practical over long distances because of issues involved in accessing a central quorum database in an efficient manner. However, when each node has its own copy of the database, geographically dispersed clusters become much more practical.

Although MNS quorums offer some interesting possibilities, they also have some serious limitations that you need to be aware of. The key to understanding MNS is to know that everything works based on majorities. One example of this is that when the quorum database is updated, each copy of the database needs to be updated. The update isn't considered to have actually been made until over half of the databases have been updated ((number of nodes / 2) +1). For example, if a cluster has five nodes, then three nodes would be considered the majority. If an update to the quorum was being made, the update would not be considered valid until three nodes had been updated. Otherwise if two or fewer nodes had been updated, then the majority of the nodes would still have the old quorum information and therefore, the old quorum configuration would still be in effect.

The other way that a MNS quorum depends on majorities is in starting the nodes. A majority of the nodes ((number of nodes /2) +1) must be online before the cluster will start the virtual server. If fewer than the majority of nodes are online, then the cluster is said to "not have quorum". In such a case, the necessary services will keep restarting until a sufficient number of nodes are present.

One of the most important things to know about MNS is that you must have at least three nodes in the cluster. Remember that a majority of nodes must be running at all times. If a cluster only has two nodes, then the majority is calculated to be 2 ((2 nodes / 2) +1)-2. Therefore, if one node were to fail, the entire cluster would go down because it would not have quorum.

Conclusion

In this article I have explained the differences between a network load balancing cluster and a server cluster. I then went on to describe the roles that the quorum plays in a server cluster. Finally, I went on to discuss the differences between a standard quorum and a majority node set quorum.

No comments: