MongoDB Matters: MongoDB Strives for a Perfect Consensus

MongoDB gained popularity with  developers very early on, but serious database engineers were often skeptical about MongoDB architecture and implementation.  One area that came under some criticism was cluster consistency.

Any distributed system must implement a consensus mechanism that allows members of a cluster to agree on the state of the cluster. In particular, distributed databases need to decide on what to do when two parts of the distributed system get separated by a network partition.

CAP Theorem—also known as Brewer’s theorem—tells us that distributed databases need to choose between availability and consistency when a network partition occurs.  Many “NoSQL” systems —Cassandra and DynamoDB, for instance—describe themselves as “eventually consistent."  An eventually consistent system will choose availability over consistency when a network fault occurs.  However, MongoDB is not an eventually consistent database: When a network partition occurs, only one side of the network partition will continue to function—the side that has a majority of the nodes in the cluster. 

According to MongoDB oral history, the original cluster protocol—pv0—was written by a single engineer with relatively limited experience. I once heard a MongoDB engineer describe pv0 as “awesome” and “almost correct.”  Unfortunately, in distributed systems work, “almost correct” is less than adequate.

Although pv0 worked in simplistic situations, it had some fundamental problems. There were known cases in which apparently committed data could be lost during cluster failovers. 

Even in non-failure modes, it was still possible to see inconsistent results in these early versions of MongoDB.  This could occur when a network error made MongoDB unsure if data had made it to a remote node. In other scenarios, MongoDB would allow reads of uncommitted data.

These problems did not overly concern most MongoDB users.  As a non-transactional database, users were generally used to the idea that some degree of inconsistency was to be expected.  However, as MongoDB geared up for the 4.0 transactional capabilities, and as MongoDB increasingly came into play in enterprise accounts, it became increasingly important to address these issues.

Successive versions of MongoDB showed improvements in the consensus algorithm.  With version 3.6, MongoDB was able to pass the Jepsen test—the premier test suite for distributed system consistency.   MongoDB 3.6 causal consistency allows database sessions to read their own writes and to see operations in a logically consistent order. Sessions submit timestamps with each database request, and the server waits for that timestamp to be reached before responding.

Many if not most, consistency problems in distributed systems occur when nodes fail.  In MongoDB, there is always a single primary in a replica set, and only primaries can service write requests.  When all the nodes in a cluster are avaialble, the logistics for ensuring consistency and consensus are fairly straight-forward. But when nodes fail, it is necessary to determine which node will take over as primary and how transactions that are in the process of replicating will be treated.

In MongoDB, when primaries fail, the remaining nodes will hold an election to determine the new primary.  In older versions of MongoDB, there were certain edge conditions that resulted in an election of a primary node which did not have the latest information, resulting in some loss of committed data during a failover.  But in version 3.6, MongoDB implemented the RAFT protocol, which resolved these issues. 

The MongoDB consensus story is typical of MongoDB’s technology arc. MongoDB often implemented simplistic solutions that only over time evolved to production quality.  Early versions of MongoDB were often truly MVP—Minimal Viable Products. While this left MongoDB open to a lot of criticism in early versions, it allowed them to iterate rapidly and resulted in high agility and momentum.  MongoDB’s success in the market is a testament to the success of this approach.