Galaxy internals, part III
My previous two blog posts detailed Galaxy’s cache-coherence protocol and fault tolerance mechanisms. This last post in the Galaxy internals series will describe the last missing piece: how galaxy nodes communicate over a network.
While Galaxy’s networking, like all Galaxy components, is pluggable and could employ various networking protocols (e.g., Infiniband), the current release uses TCP and UDP for all networking, and it does so using the excellent Netty library.
Communicating with slaves and the central server
As mentioned in my last post, each Galaxy node can be configured to replicate its owned data-items to slave nodes and/or to a central server for the sake of fault-tolerance. These update-packets may be large (as they can contain a potentially large number of modified items), are sent asynchronously most of the time (the node does not need to wait for the replication to complete), and are sent to a small number of machines (just the node’s slave/s and/or one server). Therefore, all replication communication is done via TCP. This is simple, and nothing more will be said about it here.
Peer-to-peer node communication
Communication among peer nodes has very different characteristics from those of data replication. Remember, messages sent among peer nodes are used to request ownership of items, sharing of items, or invalidation of items. They are relatively infrequent because a well-behaved Galaxy application keeps almost all work locally within the nodes; short, because data items are small, and a single message can contain at most one item; require low-latency because most request messages can block an operation until they complete; organized into request-reply pairs, so that acknowledging the receipt of each request and reply separately is wasteful; and finally, sent among a large number of nodes, each can send and receive messages to and from all other nodes. These characteristics make UDP a better choice than TCP for peer-to-peer communication among nodes. Using TCP would have entailed either keeping an open connection from all nodes to all other nodes, or, alternatively, paying the high cost of the TCP handshake frequently. TCP would also acknowledge both request and response. Also, some messages need to be broadcast to all nodes, which would require UDP multicast anyhow.
Using UDP, however, has its costs. It does not guarantee delivery, messages may arrive out of order, and it does not allow messages of unlimited length – all of which are required by Galaxy, so they need to be implemented by Galaxy code as they’re not natively provided by the protocol. All of that is done in the UDPComm class.
While conceptually the simplest component of Galaxy, UDP networking code has turned out to be the most complex, and will probably require some refactoring. Also, it is the component that I put the least thought into so far, so I believe it also has the best potential for further performance improvements. Any suggestion for improving the process outlined below would be greatly appreciated.
Messages are grouped together into packets. Each node maintains a peer object for of the other nodes, which keeps the last packet sent to the respective node, and a queue of messages waiting to be sent to it.
Packets are configured to have a maximum size, which should optimally be slightly less than the maximum Ethernet packet size, with some room left for IP and UDP headers (this is why jumbo frames are recommended when using Galaxy). A packet can contain many messages, but a single message cannot be broken up into more than one packet, and this is why all Galaxy items must be smaller than the maximum packet size (and, besides, small items are good in general as they would probably be less contended). If the packet is full, remaining messages are left to wait in the queue.
Galaxy can be configured to wait a little while (usually a few microseconds) for more messages to be sent before transmitting the packet, so that more of them will be packed into the packet, but Galaxy will only wait a short while before transmission so latency does not grow.
Galaxy’s internal message-delivery API specifically denotes each message as being either a request or a response (regardless of the message type). Requests and response are handled a little differently when it comes to guaranteeing delivery.
If a packet contains any requests, it will be sent over and over until responses are received; this ensures requests are delivered. Once a response is received, its matching request is removed from the packet and not sent again.
A response is re-sent whenever the request is received, so if the request is received twice (perhaps because the response has not been received by the requesting node), the same response will be sent twice. This ensures responses are always delivered.
A simple mechanism is used to ensure that any message, be it a request or a response, is delivered to Galaxy’s logic only once.
All messages sent from node A to node B must be delivered to B in the same order they’ve been sent by A in order to ensure correct ordering of memory operations, which is required for consistency.
All messages in a packet are processed by the receiving node in the order they’re placed in the packet. But what if a sender wants to send a message after a packet has been sent but may not have yet been received? If we were to create a new packet, ordering may be violated because UDP does not guarantee packets are delivered in the order they’re sent. If, on the other hand, we were to wait until all messages in a packet are received before sending a new one, this will unnecessarily increase latency.
Example: putting it all together
The following example will hopefully demonstrate how delivery and ordering guarantees work together (requests are marked Q and responses are marked S, Q-A2 means request #2 originating in node A, while S-B3 means a response to request Q-B3):
Some requests (specifically GET or GETX the first time an item is requested) are multicast to the entire cluster. These requests need to be ordered with all other messages sent to all the nodes. This is achieved as follows. A special message that cannot be added to the regular unicast packet is added to all message queues. Once all the messages in the packet have been acknowledged (all responses have been received), the special message is taken from the queue, and the peer object responsible for communicating with that specific node is put in a special, waiting state. Once all peer objects are waiting, the multicast packet is sent, and then re-sent, until all peer objects receive a response. Once that happens, they are all released, and can continue retrieving unicast messages from their message queues.
In order not to hold up all communication if one or a few nodes fail to respond to the unicast (due to some failure, full buffers etc.), if the number of peer objects still waiting for response falls below a configurable threshold, those peer objects are told to keep re-sending the message, but as a simple unicast message, and then all peer objects are immediately released to continue regular unicast operation.
This concludes the 3-part series on Galaxy’s internals. Future blog posts will discuss distributed data-structures implemented on top of Galaxy and their efficiency.