Page 1 of 2 next >>

Data Warehousing in the Era of Big Data

By Joe Caserta and Elliott Cordo

Jan 19, 2016

For decades, the enterprise data warehouse (EDW) has been the aspirational analytic system for just about every organization. It has taken many forms throughout the enterprise, but all share the same core concepts of integration/consolidation of data from disparate sources, governing that data to provide reliability and trust, and enabling reporting and analytics. A successful EDW implementation can drastically reduce IT staff bottlenecks and resource requirements, while empowering and streamlining data access for both technical and nontechnical users.

The last few years, however, have been very disruptive to the data management landscape. What we refer to as the “big data” era has introduced new technologies and techniques that provide alternatives to the traditional EDW approach, and in many cases, exceeding its capabilities. Many claim we are now in a post-EDW era and the concept itself is legacy. We position the EDW as a sound concept, however, one that needs to evolve.

Challenges With the Traditional EDW

The EDW implementation itself can be a fairly difficult task with a high risk of failure. Generally accepted survey data puts the failure rate somewhere around 70%. And of the 30% deemed nonfailures, a great number never achieve ROI or successful user acceptance. To a great extent this has been caused by legacy interpretations of EDW design and traditional waterfall SDLC. It’s safe to say more modern, agile techniques for design and implementation prove more successful and offer a higher ROI. These techniques allow EDW implementations to grow organically and be malleable as the underlying data and business requirements change.

The fundamental issue is that traditional EDW does not solve all problems. In many organizations, the EDW has been seen as the only solution for all data analytics problems. Data consumers have been conditioned to believe that if they want analytics support for problems, their only choice is to integrate data and business processes into the EDW program. At times, this has been a “cart before the horse” situation when extreme amounts of effort have been put into modeling new use cases into a rigid and governed system before the true requirements and value of the data are known.

In other cases, the underlying design and technology of the EDW does not fit the problem. Semi-structured data analysis, real-time streaming analytics, network analysis, search, and discovery are ill-served by the traditional EDW backed by relational database technology.

For more articles on the state of big data, download the third edition of The Big Data Sourcebook, your guide to the enterprise and technology issues IT professionals are being asked to cope with in 2016 as business or organizational leadership increasingly defines strategies that leverage the "big data" phenomenon.

Use cases such as these have become more common in the era of big data. In the “old days,” most data came from rigid, premise-based systems backed by relational database technology. Although these systems still exist, many have moved to the cloud as SaaS models. In addition, many no longer run on relational platforms, and our method of interaction with them is often via API with JSON and XML responses. Additionally, there are new data sources, such as social, sensor and machine data, logs, and even video and audio. Not only are they producing data at overwhelming rates and with inherent mismatch to the relational model, there is often no internal ownership of the data, making it difficult to govern and conform to a rigid structure.

The Big Data Revolution

In response, there has been an amazing disruption in the tools and techniques used to store and process data. This innovation was born in large tech companies such as Twitter and Facebook and continues to rapidly evolve as all organizations realize similar challenges with their own data. Today, the excitement of the big data era is not just about having lots of data. What’s truly interesting is that organizations with all data sizes now each approach data problems in different and tailored ways. It’s no longer a one-size-fits-all shoehorn into traditional systems. Organizations now objectively design and build systems based on business and data requirements, not on preconceived design approaches. This is, no doubt, enabled by more options in the technology landscape—but the change in mindset is the real game changer.

The concepts behind data warehousing become critical as they apply to big data systems: Analytic systems still need data governance; concepts of data qualityand data stewardship are absolutely critical; and conformed master data and interoperability between applications matter.

Engineer and author Martin Fowler termed this movement best as “Polyglot Persistence”—“where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we’ll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.” This states it’s natural for an organization to adopt a variety of new storage and data processing technologies based on requirements. The concept is an extension of “Polyglot Programming” and “Micro-Service Architecture,” where languages and platforms are chosen based on the ability to tackle different types of problems.

So Is the Data Warehouse Relevant?

Let’s return to the question, “Is the data warehouse still relevant in this new era?” To answer, let’s explore what has gone wrong in big data.

Once mainstreamed, big data tools such as Hadoop were picked up by various organizations to solve challenging data problems. Most started out as proofs of concept and then often launched in a production-like capacity. Many were built completely in silos, without regard for enterprise architecture. Data quality and data stewardships were considered bad words, and concepts of the “old way” almost completely ignored in design and implementation. In the end, many suffered from service-level issues and interoperability challenges with other systems, plus general lack of trust from data consumers.

This is where the concepts behind data warehousing become critical as they apply to big data systems: Analytic systems still need data governance; concepts of data quality and data stewardship are absolutely critical; and conformed master data and interoperability between applications matter.

We are firm in our belief that the humble, traditional EDW still has a place in modern data architecture. All the tech giants that promoted this new era still have data warehouses. When it comes to serving standard reporting and business KPIs and providing mission critical numbers to the organization and its investors, a well-designed EDW is tough to beat.

The Polyglot Warehouse

In order for the EDW to survive, it must evolve. We must acknowledge that a rigid, highly governed, traditional EDW is not the solution for all problems, and it now needs to integrate and coexist with other analytic solutions. This Polyglot Warehouse will consist of a set of fit-for-purpose analytic technologies, with the EDW very likely being a critical component.

Let’s discuss a few key emerging technologies and architectures supporting this evolution:

Data Lake

The term “data lake” has many definitions throughout the industry, ranging from a dumping ground for “to-be-used” data, to a more or less traditional EDW approach implemented on a big data platform.

We would best define the data lake as an analytic system that allows data consolidation and analytic access with tunable governance. The data lake consists of a distributed, scalable file system, such as HDFS (Hadoop File System) or Amazon S3, coupled with one or many fit-for-purpose query and processing engines such as Apache Spark, Drill, Impala, and Presto.

Page 1 of 2 next >>