<< back Page 3 of 4 next >>

The Data Warehousing Sanity Check

By Joe Caserta

Jan 4, 2017

Using agile methods, scrum teams will become more common in the data space. Collaboration between a newly formed chief data office, the business product owner, IT architects, and data scientists and/or analysts is required to quickly formulate and implement data solutions in time for the business to realize its value. Data initiatives that lack members with experience and expertise in the various required domains are doomed to failure.

Technologies that enable collaboration are already being embraced by startups, but within the next year, I suspect more and more organizations will adopt these technologies. There is evidence that the trend toward data analytics collaboration has come to fruition in the form of Apache Spark. Spark’s unique offering of a unified framework for data engineering and data science has caused it to quickly become the default system to meet the demanding needs to have a single interface for data discovery, exploration, ETL, and machine learning.

Vendor selection for big data technologies is no longer the land of startups, as Spark is now embraced by some of the largest technology companies in the world. IBM, for example, recently announced the launch of DataWorks, a Spark-oriented platform. DataWorks is designed to cater to five distinct personas: data engineer, data scientist, business analyst, app developer, and chief data officer. The collaboration is run on Jupyter-based notebook technology extended by IBM, and all five personas will be able to share work within the same interface. Notebooks contain data, code, documentation, annotation, and visualization. I predict notebooks will soon become the new BI toolkit.

The Corporate Data Pyramid (CDP)

As previously noted, a rigid, highly governed, traditional EDW is not the solution for all problems. It now needs to integrate and coexist with other analytic applications. The EDW will remain a critical component, but it will not be the only significant component.

For more insight into big data technologies and trends, get the free Big Data Sourcebook

Dimensional data models still prevail in the world of relational databases, but the legacy of the data warehouse—that all things data are done within a relational database—is gone. The new trend is to utilize a different, more specialized tool for each piece of the job. For example, the ETL does not need to be done within the RDBMS. Much of the data preparation work for the data warehouse environment is now done outside of the relational database. In fact, the relational database is now primarily limited to data presentation. This is indicative of a total paradigm shift to the concept of the “corporate data pyramid.” The CDP is comprised of four tiers: the data landing, data lake, data workbench, and data presentation.

The Data Landing

The data landing area is a quarantined area of the data lake that stores all data in full fidelity as it is received from its source. Data is stored and partitioned to optimize data management (source and data management dates). Data can be archived after a predetermined retention period as needed. Data in the landing area has very little data governance, as it is essentially locked down from all except managed processes that make lightly governed datasets available within the data lake.

The Data Lake

The data lake encompasses the lower tiers of the pyramid. Many companies with some experience in this area are discovering that deeper data analytics and data science can be better done without a structured environment. In fact, to save time and money, businesses are now requiring that this analysis be done before the rigorous process of structuring, formatting, and fully governing the data. The data lake includes feeds of raw events from the landing area, as well as enrichment and reference data that may come from the existing data warehouse or a master data management system. All data is tagged with metadata and can be searched via a data catalog.

The data lake is a great tool for data discovery, exploration, deep analytics, and data science. Typically, data is accessed and manipulated in the data lake via notebooks. Languages such as Python, Scala, SQL, or R are used within a Spark framework.

<< back Page 3 of 4 next >>

The Data Warehousing Sanity Check

The Corporate Data Pyramid (CDP)

The Data Landing

The Data Lake

Newsletters

Recent Big Data Quarterly Issues

White Papers

Webinars