How Semantics Can Take Graph Databases to New Levels

The recent increase in the popularity of property graph databases is well-founded, as they fulfill a real need.  But as usage continues to ramp up, the limitations of traditional property graph systems can become a roadblock.

Property graphs are a form of graph that some popular tools use to manage graph data.  The W3C’s Resource Description Framework (RDF), which forms the basis of the Semantic Web, is a technology for managing graphs.  IT practitioners who have had only a modicum of exposure to RDF are often not fully cognizant of the fact that graph management is the very heart of RDF.  For this reason and others, it is common for people to be under the impression that RDF is not capable of expressing property graphs, when in fact RDF can do so more powerfully than traditional property graph approaches can.

The Value of Graph Databases

Graph databases are especially good for managing data about the connections among resources.  They are excellent for encoding friend networks in social media, sensor networks in logistics, protein interaction pathways in life sciences, and more.  They help uncover relationships among people that may have implications for contexts such as insider trading, money laundering, and industrial espionage, and can reveal connections among financial institutions and financial instruments that have ramifications for systemic counterparty risk in masses of complex derivatives. 

This kind of connection data is exploding in volume and variety.  Thus, graph databases are here to stay as a vital mechanism for managing data.  In some enterprises, graph databases are even replacing relational databases as the primary database of record. 

What are Property Graphs?

Property graphs have the normal characteristics of mathematical directed graphs in that they consist of vertices (a.k.a. nodes) and directed edges.  Each edge connects two vertices, has a type, and can have one or more properties.   Each property is a key-value pair.  The ability to type an edge and attach properties to it increases the semantic expressiveness of directed graphs. 

Figure 1: A Property Graph

For example, in the property graph that Figure 1 depicts, the edge with the type  “Knows” connects two vertices labeled “John R Peterson” and “Frank T Smith” and the Knows edge has two properties that are the key-value pairs <Provenance, “Mary L Jones”> and <Confidence Percent,70>.   This property graph represents the following information:

  • John R Peterson knows Frank T Smith
  • The provenance of the assessment that John knows Frank is attributed to Mary L Jones; in other words, Mary is responsible for that assessment
  • The assessment that John knows Frank has a 70 percent confidence level (so it is not at all certain).

There are no official standards for property graphs, but a number of property graph tools use Blueprints, which is an API for manipulating and querying property graphs.  

Quick Overview of RDF

RDF is a system for representing directed graphs.  RDF has the notion of a logical triple, which consists of a subject, a predicate, and an object.  In a triple, the subject and object are vertices and the predicate is the edge that connects the subject and object.  The roles of subject and object imply the direction of the edge, which is from subject to object. 

An RDF graph consists of a set of triples.  A database that contains RDF graphs is called a triple store.

RDF is designed to work on the World Wide Web.  A triple’s subject and predicate are IRIs (Internationalized Resource Identifiers), which are URIs that use the Universal Character Set.  A triple’s object can be either an IRI or a literal such as a string or a number.

The W3C has standardized several concrete syntaxes for representing RDF graphs, including an XML-based syntax, a JSON-based syntax, a terse syntax called Turtle that dispenses with angle brackets, and another non-XML syntax called N-Triples.  These syntaxes are all textual in nature.  There is no standard for representing RDF graphs graphically, although there are tools that do so.

The W3C has also standardized a query language called SPARQL for extracting information from RDF triple stores.  SPARQL bears some superficial resemblance to SQL, but its structure and syntax are all about triples.  For example, you can use SPARQL to search a triple store for triples that have Knows for the predicate.

Using RDF for Property Graphs

It is quite straightforward and intuitive to represent property graphs as RDF triples. 

Encoding Property Graphs in RDF – The Basics

An RDF representation of Figure 1’s property graph has the following three triples:

Subject: John R PetersonPredicate: KnowsObject: Frank T Smith
Subject: Triple #1  Predicate: Confidence Percen Object: 70 (Literal)
Subject: Triple #1 Predicate: Provenance  Object: Mary L Jones (Literal)

Note that we can assign an IRI to represent a triple, and that is why a triple can serve as the subject of another triple (and can serve as the object of a triple as well).  There are several techniques for assigning an IRI to a triple. 

As mentioned earlier, vertices in property graphs can also have properties.  RDF can handle such properties. 

Overcoming Traditional Property Graph Limitations

We have established that RDF is perfectly capable of representing property graphs that are as expressive as the graphs that traditional property graph tools and systems support.  It is now time to explore how RDF enables property graphing to go beyond what the traditional tools and systems can do.

In traditional property graphs, the properties of edges and vertices must be literals such as strings and numbers; in other words, the properties can only have simple data types.  This limitation is significant.  In our example, suppose we wanted to be able to assert more about Mary L Jones, whom the Provenance property indicates is responsible for the assertion that John knows Frank.  If we use RDF to represent the property graph, Mary L Jones can be more than a string literal that terminates the graph; it can be a first class vertex that continues the property graph and has properties of its own.

With this capability, an expanded property graph can, for example, give us some information we can use to assess the credibility of Mary’s assertion that John knows Frank.  A full-blown Mary L Jones vertex can have a property that asserts an average credibility percentage to Mary’s assessments.   The graph, instead of terminating, continues so that the following RDF triples represent the graph:

1. Subject: John R PetersonPredicate: KnowsObject: Frank T Smith
2. Subject: Triple #1  Predicate: Confidence PercentObject: 70 (Literal)
3. Subject: Triple #1 Predicate: Provenance  Object: Mary L Jones
4. Subject: Mary L Jones Predicate: Credibility PercenObject: 80 (Literal)

The credibility percentage 80 is a literal number that terminates the graph; alternately, the graph could continue further, such that in place of this terminating literal number we have a vertex representing a document that contains an evaluation of Mary including her average credibility percentage and case history, so that we have the following triples:

1. Subject: John R PetersonPredicate: Knows  Object: Frank T Smith
2. Subject: Triple #1Predicate: Confidence PercentObject: 70 (Literal)
3. Subject: Triple #1  Predicate: Provenance  Object: Mary L Jones
4. Subject: Mary L JonesPredicate: Is Evaluated By Object: MLJEvaluation
5. Subject: MLJEvaluationPredicate: Credibility PercentObject: 80 (Literal)
6. Subject: MLJEvaluationPredicate: Case HistoryObject: MLJCaseHistory

RDF can represent these expanded property graphs naturally; traditional property graph systems cannot, because their property values must be literals, which cannot serve as vertices.

Going Further with RDF

Stepping farther beyond where traditional property graphs tread, consider the Knows edge’s Confidence Percent property in Figure 1.  The graph does not say anything about the provenance of the assertion of 70 percent confidence; did Mary assign that confidence value herself or did someone else? 

In order to assert that Mary assigned the value herself, our RDF-based property graph just requires an additional triple that establishes the provenance of the Confidence Percent rating:

  • Subject: Triple #2    Predicate: Provenance    Object: Mary L Jones 

This level of expressiveness is well beyond the capabilities of property graph systems, which do not allow a property assignment such as that asserted by Triple #2 to serve as a vertex.

Leveraging Standards

The fact that RDF is a W3C standard increases its utility for managing property graphs. 

A number of RDF-based vocabularies, taxonomies, and ontologies are in the public domain.  They are available for file-based download and/or they can be accessed over the Internet through SPARQL End Points, which allow the databases to be queried via http.   Together, these public domain RDF-based resources are referred to as the Linked Open Data Cloud, which RDF-based property graphs can leverage.

For example, the federal government’s Data-Gov system publishes an RDF dataset that contains summaries of research and development projects funded by the Department of Energy (DOE). Each entry in the dataset represents a project, where a numbered project vertex – such as Entry422 – represents a project and serves as the subject of several predicates such as project title, research organization, principle investigator, funding organization, and so forth.

If we use RDF to encode our example property graph, the graph can leverage the DOE dataset.  We could define a Met At property of the Knows edge, which points to a project in the DOE RDF dataset.  We can represent this in RDF simply by adding the following triple to the graph:

  • Subject: Triple #1     Predicate: Met At    Object: DOE Entry422 

Incorporating this information into the property graph could contribute to discovering relationships among people and companies for various purposes, such as recruiting talent, or investigating insider trading or industrial espionage. 

Semantic Graphs

The semantics of RDF are grounded in formal logic, as are all the W3C Semantic Web standards.  Reasoners based on these formalisms can evaluate RDF databases and detect certain kinds of inconsistencies and infer certain kinds of relationships that the database does not state explicitly. These capabilities explain why we refer to RDF-based databases as semantic graph databases.  Reasoners grounded in formal semantics can be potent tools when managing large graph databases.

The Bottom Line

Practitioners interested in leveraging the unique benefits of property graphs should consider the power of RDF, a technology based on recognized international standards, which is capable of taking property graphs to a whole new level. 

Image courtesy of Shutterstock


Subscribe to Big Data Quarterly E-Edition