Page 1 of 2 next >>

Semantic Graphs and the New Data Integration Landscape

Conventional data management systems are fundamentally ill-suited for the world of data as it exists today. These systems, based with few exceptions on the relational data model, are broken because they integrate based on data location at the storage layer. While this approach worked reasonably well for the past 25 years, the world today has far too much data to use data location in storage as the basic lever.

The ill-suitedness of traditional, relational data model-based data integration tools reveals itself in several ways. The most obvious difficulties occur when combining several data silos or sources together because, in nearly all cases, they were modeled differently and conform to their own independent sets of rules and constraints. Data integration breaks down for two reasons. First, a single shared data model has to represent a global view over the sources. Second, significant manipulation and transformation are typically required to transform between the source and target schema as well as make the source data conform to a set of standardized rules.

All of this manipulation, cleaning, and transforming is necessary because relational systems aren’t very good at representing contextualized business meaning, and the fact that the relational model itself, as distinct from its dominant query language, SQL, is a leaky abstraction that does not lend itself very well to integration, especially for connection-rich data. The relational data model was never intended to support the complex business processes with changing require­ments that, in today’s data landscape, are dominated by heterogeneity and diversity.

Older Integration Styles Are Falling Short

The data landscape is not only more heterogeneous than it used to be, but it has also expanded dramatically. When the relational model and SQL were being developed, semi-structured and unstructured data simply didn’t count. Emails, social network data, and IoT were all either to be invented or weren’t part of the enterprise world. In other words, relational data management systems worked reasonably well when the enterprise data landscape was itself predominantly structured—but not anymore. The enterprise data landscape is increasingly hybrid, varied, and changing.

In fact, the challenges to conventional data integration are proliferating. The emergence of IoT, the rise in unstructured data volume, increasing relevance of external data sources, and the headlong rush to hybrid, multi-cloud environments are all impediments to wide-scale data integration based on data location in storage with a relational data model. So, while the data landscape itself has changed, what about the enterprise’s requirements for data integration and analytics systems? Surely, if those requirements are relatively unchanged, then there must be some hope left for relational systems, right?

The truth is, these requirements have changed,  along with the data landscape itself, creating two relevant pressures. First, the impact of globalization and an ever-shrinking world has created unprecedented awareness of the connectedness of the human world and, of course, of the modern enterprise. Thanks in part to the pandemic, we realize more than ever that connected networks are everywhere and the data systems and data silos must be united as a result. As the name suggests, enterprise data is largely about the enterprise itself. The connected enterprise, conversely, deals in business meaning and context, which may be why they seem to be winning everywhere we look. The second pressure has been created by the rise of AI, ML, and the various analytics systems. These are nothing more than incredibly intricate, powerful machines which run on data as their essential fuel. No data, no insight—and that’s true no matter how clever the algorithm, AI team, learned model, etc.

The Future of Enterprise Data Integration

To create business value within the enterprise, an organization must be able to connect all the data that matters. Yet because of changes in the data landscape, this information is literally spread everywhere and includes numerous formats, types, storage systems, applications, and computing environments. Worse, it resides everywhere, whether internally or externally via public or private cloud. Of course, some of this data exists as relational tables, but more and more of it will exist in some other form in the future, which may not be amenable to being represented by or with relational tables.

The real problem with conventional data integration systems based on the relational model comes down to representation. The new diversity of requirements and data landscape has finally burst the long-running illusion that you can just jam non-relational information/assets into relational storage and it will all be fine. The NoSQL movement of data storage systems has already reckoned with that reality, leading to another proliferation, this time in the area of database systems.

However, the data integration space is still completely dominated by relational-first and relational-only systems that are stuck with the fundamental idea of data location in the storage layer. Given the current representational problem, many companies now understand that only the semantic graph data model is able to represent data that is natively stored in any other structures and to connect all relevant metadata and context. 

Semantic graphs create meaning by mapping entities, their metadata, and their relationships in an evolving information network. By applying a fundamentally different approach to data integration, organizations can substitute the idea of business meaning in the compute layer for data location in the storage layer. In this way, next-wave data integration systems will leverage semantic graphs and data virtualization technology to represent the connectedness of data in a way that unlocks business value by shortening the gap between what data means and how its managed, queried, searched, analyzed, and connected. 

Consider this actual use case as an example. Dow Jones is using a semantic graph as its data integration platform to connect multiple data silos and using advanced AI techniques to “reimagine the news.” With access to millions of facts derived from 50 years of digitized news media, Dow Jones is a great example of how the data landscape has changed. Internally, it also demonstrates how it has a unique data universe that is naturally unconnected. The business environment floods Dow Jones’s customers with endless noise, but there is a signal within the noise, and Dow Jones has built a personalized news sense-making application that focuses on what customers need to know and when.

Page 1 of 2 next >>


Subscribe to Big Data Quarterly E-Edition