How Galaxy Handles Failures
Galaxy internals, part II
Last time I described how Galaxy, our open-source in-memory data grid, employs a cache-coherence protocol, similar to the ones used to coordinate CPU L1 caches, in order to enforce data consistency in the cluster. One big difference between common hardware protocols and the one used by Galaxy was that Galaxy tries to minimize broadcasting messages by keeping track of all sharers of every item, and by remembering an item’s most recent owner. This is necessary because the number of nodes in the cluster, unlike in most CPUs, can be quite large, and pestering all of them with broadcast messages can seriously hurt performance. Actually, some many-core hardware systems employ a similar technique (called the directory-based approach), only they have to deal with the problem of maintaining the directory in the memory-constrained environment of the L1 cache, while Galaxy doesn’t have this problem.
However, the biggest – and most interesting – difference between the hardware implementations and Galaxy is fault tolerance; hardware doesn’t have any. If a core or the CPU bus fail – the CPU fails, and that’s that. But in a cluster, the story is different. Galaxy, like any robust distributed system, must handle network and node failures. How it does that is the subject of this post.
As explained last time, most Galaxy messages are grouped into request/response pairs. Requests are sent as a result of some API call, and if a response is not received within some known, configurable, duration, the call throws a timeout exception. Timeouts can be thrown by all Galaxy operations, and must be handled by the application (usually, simply by retrying the operation, possible after some delay). Timeouts do not necessarily imply a software or hardware failure. Most commonly they are the result of a deadlock, which could happen when two or more nodes pin more than one item at the same time. In that case, a timeout will cause a transaction on one node to back-off while a conflicting one retries.
A common problem in distributed systems with timeouts is that there is no way of knowing whether the remote operation failed, or perhaps succeeded but the response has been delayed. Galaxy does not have this problem due to the nature of its remote operations: none of them are destructive. In fact, none of them have any effect on the data at all. Remember, all writes (data modifications) requested by a node are carried out locally on the node once it has received ownership of the relevant items. All remote requests do is transfer item ownership or invalidate shared copies. Actually, invalidation requests never time-out; they are retried indefinitely because they can never be involved in a deadlock, and the only reason they could fail is as a result of a node failure, which will eventually be detected and handled as explained later. So the only interesting request which might time-out is GETX (a GET can time-out as well, but that scenario is not that interesting).
If node A wants ownership of an item currently owned by node B, it will send B a GETX message. B will then mark the item as owned by A and respond with a PUTX. What happens if the response is delayed for any reason? Node A’s operation will eventually time-out, and the node will continue to believe that node B is the item’s owner. Node B, however, will think A now owns the item, so it will respond to any additional requests for the item from, say, node C, with a CHNGD_OWNR message. Assuming that message is received by C, it will then try to retrieve the item from A, which will, in turn, respond with yet another CHNGD_OWNR message claiming that B is the owner. If C’s operation does not time-out it will bounce from node B to A and vice-versa, sending its requests to one and then to the other, but eventually (unless a node failure occurs), the PUTX message will arrive at node A (node B will keep re-trying), which will then assume ownership and respond correctly to requests.
Galaxy employs third-party software for cluster membership management, shared configuration and failure detection. In particular, it uses either JGroups or Apache ZooKeeper for the job. JGroups and ZooKeeper detect node liveness or failure by means of heartbeat messages recorded at some central location, so they provide a definitive (or, at least, consistent) registry of live or dead nodes.
When they detect a failing node, it is removed from the cluster. It is Galaxy’s, role, however, to ensure that the data stored on the dead node is not lost. This is achieved by redundancy. All data items owned by a node (and are therefore written only by that node) are replicated to slave nodes (each replicating a single master node) and/or to a central server that persists all data in the cluster to secondary storage (disk). Once a node fails, its owned data is served by its slave or by the server. JGroups/ZooKeeper are also used to inform all nodes of the current slave/master status of all other nodes, so that if a slave becomes a master upon its master’s failure, all nodes know which node is the new master.
We must watch out for one thing. As we’ll see shortly, the data served by the server or the slaves following a node failure, while always consistent, may not be as up-to-date as the latest item versions on the failed node. This, and the fact that failing to do so would wreak havoc on item ownership, we must ensure that a node does not respond to any requests if other nodes believe it is dead (messages passed between nodes completely bypass JGroups/ZooKeeper). There is no efficient and 100% way to guarantee that, but we can ensure at a high probability that this does not happen if we set the node’s connection timeout to ZooKeeper/JGroups is smaller than the timeout required for them to detect a node’s failure. This way a node will disconnect from ZooKeepr/JGroups before it is known to be dead, and in that case it will shut itself down.
Before I delve into the particulars of slave nodes and the central server, I’d like to describe the general operation of data replication. After each write operation, an item’s latest version is kept in a backup packet, which is periodically sent to the node’s slaves and/or to the central server. This asynchronous replication enables a very high write rate, as writes do not have to wait for backups to complete. However, because writes complete before replication is acknowledged, and, in fact, not every item version is replicated at all – remember, backups are only done periodically – some durability is sacrificed for the sake of reducing latency. A node failure could mean losing, say, all updates made in the last 100ms – if that is the backup period we’ve configured.
But whether or not durability is maintained, we must preserve consistency at all costs, and asynchronous, periodic replication can jeopardize that. If node A has just finished writing version 100 of item X, which is then shared (via a GET message answered by a PUT) by node B, and then node A fails having only backed up version, say, 95 to its slave, future requests for the item (from the slave-turned-master) will yield version 95, but version 100 has already been read, and possibly used by node B to produce some other item’s value, and consistency is irreparably broken.
So, just as we do when waiting for INVACKs, all local operations on the item are allowed to proceed regardless of replication, but once the item is requested by another node, we flush the backup packet and wait until it is acknowledged before responding to the request. (As long as an item’s latest version has not been replicated, it is flagged as modified. The modified flag can be set while the item is either in E state, or even in the O state if a write has been carried out before all sharers have INVACKed, and this is why Galaxy’s cache-coherence protocol does not have a separate M (modified) state as those used by CPU L1 caches.) Even if a node suddenly dies, then, though its last few writes may be lost, data consistency is always maintained.
This general principle is behind several Galaxy optimizations: we allow write operations some slack until an external observer is involved. Because a well-behaved Galaxy application requires inter-node data transfers relatively rarely compared to intra-node operations, such optimizations should result in significant latency gains.
Each node may have zero or more slave nodes replicating its owned items. Only the owned items are replicated – not the shared ones, so when a node dies, the slave selected to replace it as master will not have any shared items. Therefore, when the other nodes detect its death, they will remove it from all of their relevant items’ sharers list. It does not matter if they perform this bookkeeping late, as all INV requests sent to the new master will automatically result in an INVACK response if the shared item is not found (this is Galaxy’s general behavior). Also, because slaves are not informed of items’ sharers, when a slave assumes the master’s role, it assigns all items the E state, so when node death is detected, all of its items’ sharers will invalidate their copies.
The logic outlined so far suffices if a node may have only one slave. If it has more than one, we must watch out for another possible scenario. Say node A has two slaves A1 and A2, and say that the latest backup packet sent from A to its slaves contains, perhaps among other items, version 100 of item X. A1 receives the packet, but then node A fails before A2 receives it. If A1 replaces A as the master, it has no way of knowing that A2 has an older version of X, and if A1 then fails without ever writing X and replicating it to A2, A2 would become the master with an old and – much more importantly – inconsistent value for X (as version 100 may have been used to compute other values). If, on the other hand A2 becomes the master, serves, say, version 95 of X, and then fails, the new master, A1, would now have an inconsistent value for X yet again.
One way of solving this is putting in place a consensus protocol to ensure all slaves agree on all values, but that would be very inefficient. A simpler solution is to perform leader-election among all of A’s slaves when it fails, with each slave advertising the latest packet it has received, electing the most up-to-date slave as the new master, and then pushing the missing updates to the other slaves.
This solution is not yet implemented in Galaxy, and that is why the current version allows a single slave per node (though you can manually start a new slave once the old one assumes the role of the master).
Optionally, and regardless of whether or not slaves are used for availability, it is possible (and recommended) to configure the Galaxy grid to use a special node called, simply, the server. The server is to the peer nodes (as the regular nodes are called) not unlike what main memory (RAM) is to the L1 caches – it holds all the data items in the cluster in a large, though relatively slow, disk storage (a Galaxy server can currently be configured to use either BerkeleyDB or any SQL database as its disk-persistence engine). When a node fails, if it has no slaves, the server can then serve its owned items. In addition, the server’s persistent data-storage can be a desired feature in and of itself.
Just like with the slaves, when a server is present, all nodes periodically send it backup packets, and the entire replication logic remains the same. However, unlike a slave node, the server retains all items from all the nodes, so it must know, at any given time, which node owns what items, so that it will only serve relevant items when a node fails (remember, the server, or a slave, serving items when the owner node is alive can destroy consistency, because replicated data can be somewhat older than live data, so serving both could be catastrophic).
Simply remembering which node sent the backup packet containing the latest item version is not enough because an item can change hands without any new versions being created. So whenever a node gains ownership of an item (when it receives the PUTX message), it informs the server of the transfer by sending it an INV message, but unlike INV messages sent to the item’s sharers, it waits for the server’s response before allowing writes for reasons that will become clear shortly.
If a node then dies and has no slave, the server and all peers mark all of its owned items as owned by the server. If a node cannot locate an item in any of the peers, i.e. no node responds with a PUT to its GET multicast, it will request it from the server.
Another tricky scenario we have to contend with is the following: assume node A owns item X, and it transfers it to node B. Node B then immediately sends an INV message to the server to inform it that X is now owned by B, but before the message is received by the server, node A dies and the server marks its items as owned by the server. Then node C can come along, request the server for ownership of item X, and the server will happily comply, resulting with both B and C believing they’re each X’s lawful owner. Galaxy handles this case by having the server reply with an INV of its own to B’s INV request, rather than with an INVACK, and this is why B must wait for the server’s reply before writing X, to ensure that it is truly the sole owner.
Because the server keeps track of every item’s current ownership, it can serve another role other than a backup mechanism: some cloud environments (most notably Amazon EC2 and Rackspace) do not allow multicasts, so Galaxy can be configured to ask the server for the identity of an item’s owner when it is initially requested, rather than multicasting the request to all peer nodes.
The server’s entire logic is implemented by the MainMemory class.
Special thanks to Henry Robinson of Cloudera for going over Galaxy’s fault-tolerance design, pointing out flaws and suggesting solutions.