On Distributed Memory Systems
Or: Introducing Galaxy, a novel in-memory data grid by Parallel Universe
A distributed memory system is one that serves data items from multiple nodes that communicate by passing messages to each other, while providing the illusion that the items are served from a monolithic memory. There are three big problems facing a distributed memory system. The first, data placement or partitioning, is the problem of choosing on which node a data item should reside, as the entire data cannot fit in a single node. The second, consistency, concerns the possibility of reading different values of the same data item, as the distributed system may store it on different nodes. The third, fault tolerance, is the problem of handling node or network failures.
Designing a distributed memory system requires making tradeoffs with regard to those three problems, tradeoffs that affect performance, ease-of-use, and system availability.
While there has been much discussion about tradeoffs concerning consistency and fault-tolerance, in this blog-post I’d like to address the problem of data placement. The choice of how to partition data items across nodes determines the item access latency, as well as the load on each node.
To the best of my knowledge, all widespread distributed memory products, be it NoSQL databases like Dynamo, Riak, Cassandra and others, or in-memory data grids (more about those later) like Gigaspaces, Infinispan and Terracotta, use some sort of a distributed hash table (DHT), where one or several fields of each item or of the key used to access it are used to compute a hash value, which, in turn is mapped to a node. In order to locate an item, all you need to know is its hash-value.
For load balancing, these products often use consistent-hashing which helps distribute items evenly among cluster nodes as nodes are added or removed from the system. For completeness’ sake I’ll mention that some of these products (Riak, Cassandra) allow for eventual-consistency, while others (all IMDGs) acquire distributed locks to coordinate multi-item transactions.
Such DHT systems have the advantage that items can be located very easily. If you have the key, all you need to do is compute the hash and have some information about the nodes, like their position on the consistent-hash circle. Accessing an item, for either a read or a write operation, would then require a single network hop. This is the best we can do if we need to access items completely at random - which is indeed the case for many applications.
The downside to this approach is that while the single network hop is the worst-case scenario, it is also the common case. If we perform any computation on the memory nodes themselves (as is often done with data grids), if a computation requires more than one data item, chances are that they will each reside on a different node. Some systems solve this problem by using some sort of locality-preserving hashes that ensure that items are grouped in nodes in a way that keeps “similar” items together (whatever similar means for the application) so that range queries could be carried out efficiently without hopping all over the cluster to retrieve the data. Still, such solutions are rigid in the sense that items are not expected to change their hash value - and there’s usually no reason why you’d want them to in a key-value store.
How CPUs do it
Main memory (RAM) access is the slowest operation modern CPUs perform that doesn’t involve actual I/O. This is why modern CPUs employ on-core cache memory (L1 cache), which is a small memory unit attached to each of the CPU’s cores, with access latency two orders-of-magnitude smaller than main memory access. And because communication among cores is still an order-of-magnitude faster than accessing RAM, the on-core caches form a distributed memory system.
However, L1 caches do not use the distributed hash-table approach, because accessing a different core’s L1 is much slower than accessing the L1 of the cache your code is running on, and as we saw, a DHT access requires one hop in the common case. Instead, L1 caches migrate data items (called cache lines) from one cache to another using a cache-coherence protocol, which is also used to enforce consistency.
The way this works is that a cache line is “owned” by a single L1 cache at a time. If a thread running on one of the cores wishes to read a memory address, a copy of the appropriate cache line is fetched from the cache that owns it (or from the main memory if none do, but that detail is of no concern to us at this point). When a thread wishes to write to a memory address, its core’s L1 cache requests the owner cache for a transfer of ownership, and then requests all other caches to purge (invalidate) their read-only copies in order to ensure consistency (I highly recommend reading this excellent paper by IBM’s Paul McKenney that explains this process in detail). After that, the cache line remains owned by the thread’s L1 cache, until it is invalidated by another thread, running on a different core, that wishes to write to that cache line.
That means that when the thread wishes to write to the same cache line again, it would require no further bus communication, no further “network hops”. However, this it comes at the cost of an expensive initial lookup: because we cannot guess the owning node simply by the memory address (the “key”), we have to search for it on all cores, usually by broadcasting a request that requires the attention of all caches.
The reason this tradeoff pays for CPU cores is that the L1 caches are very small (a few megabytes), and most code displays a behavior known as temporal-locality: it tends to access the same memory addresses over and over in some short period of time. If the L1 caches were bigger than the span of a program’s temporal-locality (the amount of memory it accesses repeatedly in a short time-span) then it would contain many cache lines that a thread would only access once before another thread, on another core, requests them. Those lines would migrate randomly from one cache to another, and we’d lose most of the performance gains afforded by the cache.
This is why most software distributed memory systems, those that distribute memory across a cluster of machines, prefer using a DHT approach rather than cache-coherence. Each node contains a large amount of data, far exceeding applications’ temporal locality, so most applications’ data access patterns appear to the system as effectively random. In that case, it’s better to pay one network hop in the common case, and avoid the expensive lookup cache coherence requires in the worst-case.
When we were building SpaceBase, our real-time spatial data store for MMO games, location-based services and military C4I systems, we were looking for a solution that would let us distribute the data store over a cluster to support more objects and more processing, so we took a look at in-memory data grids. An in-memory data grid (IMDG) is a distributed memory systems that stores all data in cluster nodes’ RAM (as opposed to using the disk), and one that serves as the main data store (as opposed to a cache for some database). So IMDGs have relatively low latencies, and, unlike some NoSQL databases, they usually guarantee strong consistency, which our target applications demand.
Storing and querying spatial data is usually harder for a database than managing one-dimensional data. Spatial data structures are more complicated than their 1D counterparts, and are often less efficient. But spatial data does have one quality working in its favor - the access patterns aren’t random. Range queries are much more common for spatial data (e.g. query all objects contained in a region) than for one-dimensional data, and when accessing an object in a query or transaction, there is a very high probability that spatially near objects will be accessed as well, and a low probability that we’d access faraway objects in the same transaction.
Distributing a spatial database over a DHT would nullify this one advantage. Even if we were to hash the object’s location and use a locality-preserving hash, it would not have helped us, as the objects move around and this would require constantly changing the hash values; using a constant id for a key would not have made sense either, because then each transaction access random keys and would require many network hops.
And that’s why we’ve begun to build Galaxy, a new kind of IMDG that does not use hashing (or keys at all) for partitioning, but rather a software implementation of a cache-coherence protocol, very similar to the one used by CPU L1 caches. As objects move around in space, our database would use Galaxy to migrate them from one node to another, and because nearby objects are often accessed at the same time, keeping them on the same node would do away with network I/O for the common case, and would offset the higher cost of an initial object lookup.
An IMDG like Galaxy would not fit all applications. Those that display random data access - like commerce and finance applications I suppose, though I am not too familiar with them - are better off choosing one of the very good DHT products. But applications that require low latency processing of dynamic data that is best stored in bunches, and that requires strong consistency guarantees, would benefit greatly from a “cache-coherent” IMDG. Other than applications dealing with moving spatial objects (like MMO games), I think graph databases will benefit from distribute their data with Galaxy, as well. Graph databases’ most common operation is graph traversal, which has the same property as spatial databases - nearby objects are often accessed together, while objects far apart rarely are.
Now, at Parallel Universe we’re used to building military grade software. But, as we announced yesterday on TechCrunch, we’re now part of the summer ’12 batch of Y Combinator, so we’ve decided to try and adopt the start-up culture of “release early, release often”, and release Galaxy now, when it is not yet production ready. And because we think it has many uses that we haven’t thought of, and because we think there is a lot of interesting research to be done with a cache-coherent IMDG, we’re releasing it as open-source software, under the LGPL license. You can find it on Github right now.
How Galaxy works
Galaxy is a distributed RAM. It is not a key-value store. Rather, it is meant to be used as a infrastructure for building distributed data-structures. In fact, there is no way to query objects stored on Galaxy at all. Instead, Galaxy generates an ID for each item, that you can store in other items just like you’d store a normal reference in a plain object graph.
The application runs on all Galaxy nodes alongside with the portion of the data that is kept (in RAM) at each of the nodes, and when it wishes to read or write a data item, it requests the Galaxy API to fetch it.
At any given time an item is owned by exactly one node, but can be shared by many. Sharers store the item locally, but they can only read it. However, they remember who the owner is, and the owner maintains a list of all sharers. If a sharer (or any node) wants to update the item (a “write”) it requests the current owner for a transfer of ownership, and then receives the item and the list of sharers. Before modifying the item, it invalidates all sharers to ensure consistency. Even when the sharers are invalidated, they remember who the new owner is, so if they’d like to share or own the item again, they can request it from the new owner. If the application requests an item the local node has never seen (or it’s been migrated again after it had been validated), the node multicasts the entire cluster in search of it.
The idea is that when data access is predictable, expensive operations like item migration and a clueless lookup are rare, and more than offset by the common zero-I/O case. In addition, Galaxy uses some nifty hacks to eschew many of the I/O delays even in worst-case scenarios.
In the coming weeks I will post here the exact details of Galaxy’s inner-workings. What messages are transferred, how Galaxy deals with failures, and what tricks it employs to reduce latencies. In the meantime, I encourage you to read Galaxy’s documentation and take it for a spin.
A parting note
Galaxy is still experimental. It is absolutely not ready for production, but it is ready for research. It has bugs. Probably lots of them, and probably some conceptual bugs as well. But that’s mostly because, we think, it’s something quite new.