Big Data: The Battle Over Persistence and the Race for Access Hill

Jan 7, 2014

By John O’Brien

<< back Page 3 of 4 next >>

Companies and vendors are beginning to accept that there needs to be multiple database technologies interwoven together to deliver the much needed Modern Data Platform (MDP), but keep in mind that the pendulum will continue to swing—it may be 5 or 10 years from now, but some things about technology that we know hold true. Computing price-performance will continue as it has with Moore’s Law, so we can converge higher numbers of CPU cores in parallel with lower cost, more abundant memory with faster solid state storage, and higher capacity mechanical disk drives. Tack on the rate of technology innovation and maturity that is driving big data today, and we could see the capabilities of Hadoop derivatives, MongoDB, or some emerging data technologies eclipse highly specialized and optimized data technologies being deployed today to meet demands.

There are great debates about the disparate databases ecosystems versus the all-in-one Hadoop—it’s simply a matter of timing and vision versus the reality of today’s demanding, data-centric environments.

The Ability to Manage Semantic Context of All Data

When you accept the premise of a federated data architecture based primarily on workloads rather than logical data subjects, the next question that arises is, “How do I find anything and where do I start?” The ability to manage the semantic context of all data, its usage for monitoring and compliance, or to provide users with a single or simple point of access is the Race for Access Hill.

When you think about “the internet,” you realize that it’s used as a singular noun, similar to how “Google” has become a verb meaning to search through the millions of servers that comprise the internet. Therefore, if the Modern Data Platform represents all the disparate data stores and information assets of the enterprise in a singular noun form, we need a point of access and navigation. Otherwise, the MDP is simply a bunch of databases.

One major concept at stake for modern data architects in the Race for Access Hill is how to centralize semantic context for consistency, collaboration, and navigation. Previously in the organized world of data schemas, there were many database vendors and technologies that made data access heterogeneous, but it was still unified SQL data access under a single paradigm. Federated data architectures were predominantly still SQL schema in nature and easier to unify. Today’s key-value stores, such as Hadoop, have the ability to separate the context of data or its schema from the data itself, which has great discovery-oriented benefits for late-binding the schema with the data, rather than analyzing and designing a schema prior to loading data in as a traditional RDBMS.

Centralizing context can be done in a Hadoop cluster’s HCatalog or Hive components for semantic integration with other SQL-oriented databases for federation, hence joining the SQL world where possible. (Reminds me of my favorite recent Twitter quote, “Who knew the future of NoSQL was SQL?”) Data virtualization (DV) can serve as a single access point for the broad, SQL-based consumer community, therefore becoming the “glue” of the Modern Data Platform that unifies persistence across many data store workloads. The later addition of HCatalog and Hive to Hadoop also has this capability, but only for the data that can fit this paradigm; MapReduce functionality was designed to enable any analytic capability through a programming model. Other NoSQL data stores, such as graph databases, don’t inherently “speak SQL,” so in order to be comprehensive, an access layer (or point) needs to be service-oriented as well. Consumers will need a simple navigation map that allows them to access and consume information from data services, as well as virtual data tables. The long-term strategy will lean further toward a service-orientation more and more over time; however, virtualized data will still be needed for information access situations.

Centralized Access to Big Data

The resolution for this portion of the Race for Access Hill will be gradual within the coming years; as the need arises, a technology and strategy are already in place for companies to adopt. However, this is not the case with the “hill” portion of the “the race”: Vendors are racing to position their products to be that single point of access (the hill) with compelling arguments and case studies to support them. Aside from the SQL/services centralization of semantic context, the next question becomes, “Where should this access point live within the architecture?”

There are four different locations or layers where centralized access and context could be effectively managed—a continuum between two points with the data at one end and the consumer or user at the other, if you will. Along this continuum are several points where you could introduce centralized access and information context.

<< back Page 3 of 4 next >>