NoSQL and Document-Oriented Databases

Because any database that does not support the SQL language is, by definition, a "NoSQL" database, some very different databases coexist under the NoSQL banner. Massively scalable data stores like Cassandra, Voldemort, and HBase sacrifice structure to achieve scale-out performance. However, the document-oriented NoSQL databases have very different architectures and objectives.

Way back in the mid 1980s, Lotus founder Mitch Kapor and, until recently, Microsoft chief architect Ray Ozzie worked together to create a collaboration and personal productivity tool called Lotus Notes. Lotus Notes is most often thought of as an alternative to the email and scheduling capabilities provided by Microsoft Outlook, and very rarely as a database platform. However, Lotus Notes included a back-end database that was optimized for storing and working with complex documents.

Lotus Notes ended up inspiring the approach taken by two of today's best-known NoSQL systems: CouchDB and MongoDB. Like Notes, these database systems store information not as normalized relational tables, but as documents in a rich self-describing structure. Both use a variant of JavaScript Object Notation (JSON) to store these documents. JSON is somewhat like XML, but offers more compact storage and lower processing overhead.

Document databases primarily appeal to developers for the very reason that relational databases don't. The RDBMS (relational database management system) entity-relational data model is usually inherently different from the object-oriented model of modern programming languages. The effort needed to translate objects back and forth from the RDBMS is a drag on programmer productivity. Object-relational mapping (ORM) systems like Hibernate exist to automate this mapping, but they only partially relieve the pain.

In a document database, the document can map almost directly to the programming language's class structure. This makes programming easier, but does raise issues of data integrity, since some data items are almost inevitably duplicated. For instance, in an RDBMS product, names are stored in a table that is separate from order details. But, in a document database, it would not be unusual to embed product names directly in the order document, which would create problems if we ever renamed a product.

Document databases also promise a more flexible approach to schema changes. In an RDBMS, any change to the data model is costly: programs need to modified, then deployed in conjunction with the schema change. On large databases, the schema change itself might involve propagation through hundreds of sharded database nodes. In a document database, an application can modify the document structure whenever it wants. That's a mixed blessing, of course, because there's a very real risk of having inconsistent or obsolete document structures as a result of application version changes in document databases.

The document model has some features that encourage scalability. Because all the data needed for most operations is held in a single document, there is no need for joins or multi-object transactions; in fact, these are not directly supported. Omitting joins and transactions eases clustering issues, and both CouchDB and MongoDB support scalable clustering, either through sharding (MongoDB) or consistent hashing (CouchDB).

CouchDB and MongoDB also provide relatively rich query tools-MapReduce functions, views, and secondary indexes. Query limitations, therefore, are generally less restrictive in document databases than in other NoSQL databases.

All of this sounds quite brilliant, but document databases are not without their drawbacks. The popularity of the document model is driven more by programmatic elegance than by scalability and performance. It's not clear that these databases can outperform SQL databases at large scale. Early adopters also have experienced availability and data integrity issues common to technologies that are still maturing.

Nevertheless, the document databases offer a lot of attractive features and are an important part of the NoSQL landscape.