<< back Page 2 of 2

Data Warehousing in the Era of Big Data

By Joe Caserta and Elliott Cordo

Jan 19, 2016

The big data pyramid shown below illustrates the different layers of the lake and what they represent from a data consumption and governance view.

Landing Area: This layer is where source data is stored in its full fidelity. This layer reduces barriers to onboarding new data, allowing early analytic access for new insights and the raw materials for “to-be” data products. Only very basic governance policies are required in the form of metadata (very often in the form of a partitioning schema) and information lifecycle management (security and disposition).

Data Lake: Data may be graduated from the landing area to the data lake. This data has basic governance policies, including data quality, retention, and metadata. It often has standard views or projections, allowing users to interact via familiar tools such as SQL, data exploration, and business intelligence tools.

Data Science Workspace: This is the foundry of new data products. The work of data science may result in new data products, including new EDW facts.

Big Data Warehouse: This layer is fully governed, providing accurate, reliable, and well-defined information to data consumers. This big data warehouse may be platformed alongside the broader data lake, or in combination with traditional relational or MPP database technology.

NoSQL Extensions

As much as we love the humble relational database and the ubiquitous SQL language, they do not support all use cases. Often, scale is the issue—we simply have too much data to store, or it is coming in too rapidly to be supported with conventional platforms. In other cases, the core data access patterns and processing requirements are not easily supported. Under these circumstances, we can look to NoSQL technology to extend our data warehouse.

For more articles on the state of big data, download the third edition of The Big Data Sourcebook, your guide to the enterprise and technology issues IT professionals are being asked to cope with in 2016 as business or organizational leadership increasingly defines strategies that leverage the "big data" phenomenon.

Until recently, most analytic solutions, including the EDW, have been back-office systems. They were there to support analytics for internal data consumers. This is fundamentally changing as systems are being leveraged more for “front end” consumption within client- and public-facing applications and increasingly bombarded with high volumes of data. Use cases include real-time processing of streaming data, recommendations, customer profile data display or detailed time-series (customer API), and competition leaderboards. All require analytic systems to provide fast data ingestion, low latency of retrieval, and data models that support our storage and access patterns.

These are perfect circumstances for NoSQL-backed applications to bolster a EDW. We can back end our real-time processing and “customer API” with key-value or columnar databases such as Cassandra or Dynamo. We can platform our leaderboard on Redis, taking advantage of its sorted-set data structures.

Graph database technologies offer additional data processing and data analytic capabilities not provided through traditional means. Graphs are literally everywhere these days: social networks, telecommunications and networking, physical routing, generating recommendation, and dealing with general path optimizations problems. Graphs databases were born for solving these sorts of problems.

Search comes to the rescue in the world of semi-structured data and free text. The majority of new data being generated is unstructured. Search technologies such as Solr and ElasticSearch can help users wrangle, explore, and analyze data with the ease of a Google search bar.

Apache Spark

Born in UC–Berkeley as a science project, Spark has skyrocketed to one of the most exciting and active open source projects around.

Flexibility is one reason for its success. Apache Spark is the Swiss army knife of data processing, providing SQL, streaming, machine learning, and graph processing. It also provides APIs for most major languages, making Python, and even R, first-class citizens in a world dominated by Java.

Spark provides several key enablements for our Polyglot Warehouse architecture. Foremost, it provides a high-performance framework for data analysis and data processing. Engineers and data scientists can use languages they are comfortable with, interact with high-level intuitive APIs, and choose processing models that best fit their use case.

SQL is front and center in Apache Spark. The ubiquity of SQL is unquestionable, and it’s the language of choice for many technicians and business users. SQL enables the leveraging of existing tooling such as data exploration and business intelligence. Optimizations to Apache Spark’s SQL engine continue at a rapid pace and will likely achieve performance parity with leading database engines in the very near future.

Spark Query Federation

Spark provides a unique capability when it comes to query federation. SQL queries can be defined to span semi-structured data in Hadoop, tables in a relational or MPP database, and even NoSQL sources. Spark can therefore serve as the “glue” for ad hoc analysis and exploration in our Polyglot Warehouse.

Query federation capabilities do not discourage data consolidation efforts promoted by the data lake and EDW. Instead, we want to support early insights and exploration and blend together sources of data that wouldn’t otherwise be possible.

What’s Ahead

The Data Warehouse is finally evolving, breaking from its rigid confines. This is a change well overdue—and accelerated by the big data era. More important than the new technologies and techniques ushered in by this era is the change in mindset. There is no longer just one tool for the job, and we need to be agile and pragmatic when developing solutions for different problems.

We cannot, however, forget the fundamental principles of traditional EDW. This foundation of governance and integration, coupled with innovation and technologies, will build the analytic systems of tomorrow.

<< back Page 2 of 2

Data Warehousing in the Era of Big Data

NoSQL Extensions

Apache Spark

Spark Query Federation

What’s Ahead

Newsletters

Recent Big Data Quarterly Issues

White Papers

Webinars