Graph Databases and the Value They Provide

The relational database is primarily oriented toward the modeling of objects (entities) and relationships.  Generally, the relational model works best when there are a relatively small and static number of relationships between objects.  It has long been a tricky problem in the RDBMS to work with dynamic, recursive or complex relationships.  For instance, it's a fairly ordinary business requirement to print out all the parts that make up a product - including parts which, themselves, are  made up of smaller parts.  However, this "explosion of parts" is not consistently supported by all the relational databases.  Oracle, SQL Server and DB2 have special, but inconsistent, syntax for these hierarchical queries, while MySQL and PostgreSQL lack specific support.

To make the situation worse, the World Wide Web and many Web 2.0 sites exhibit far more complicated networks of relationships than were expected when SQL was designed.  The network of hyperlinks that connect all the pages on the World Wide Web is extraordinarily complex and almost impossible to model efficiently in an RDBMS.  Similar issues are involved in modelling the social networks of Twitter, Facebook and other comparable sites.

Graph databases aspire to overcome these problems by adopting a data model in which the relationships between objects or entities are equally as important as the objects themselves.  In a graph database, data is represented by nodes, edges and properties.  Nodes are the familiar objects that we might model in a RDBMS or key-value store - customers, products, parts, web pages, etc.  Edges represent some relationship between two nodes - friendships, links, composition, etc.   Both nodes and edges can have properties, so the nature of a relationship can be given quite specific characteristics.  Traversing - walking - the network of relationships is a fundamental and efficient operation in a graph database.

The two best known graph databases are probably FlockDB and Neo4J.

Neo4J is a Java-based graph database that easily can be embedded in any Java application, or run as a standalone server.  Neo4j has been around in various forms for many years, with the first official production release in early 2010.    Neo4j supports billions of nodes, ACID compliant transactions and multi-version consistency. 

Multi-server scalability is still a work in progress for the Neo4j team.  Most NoSQL databases horizontally scale across multiple servers by partitioning or "sharding" data, so that all the data of interest usually can be found on a particular host.  For instance, all the information pertaining to a particular user might be sharded to a specific server.  The nature of graph databases makes this fairly difficult, since, by definition, the edges define relationships between multiple nodes so they can't easily be isolated to a specific server.

Last year, Twitter open sourced their graph database FlockDB.  FlockDB overcomes the difficultly of horizontally scaling the graph by limiting the complexity of graph traversal.  In particular, FlockDB does not allow multi-hop graph walks, so it can't do a full "explosion of parts."  However, it is very fast and scalable if you only want to access first level relationships.

There are other graph-related projects of note, including attempts to add graph-like capabilities to the Hadoop ecosystem through the Hama project - the Gremlin graph programming language - and other commercial and open source graph databases such as InfiniteGraph and GraphDB.

Like most NoSQL database systems, graph databases are not a general replacement for the relational databases that have been the mainstay of business data storage for the last two decades. Rather, they are highly specialized systems tailored for very specific use cases.  Those use cases are increasingly moving from niche to mainstream, however, so I expect to see increasing interest in graph databases.