<< back Page 4 of 4

The Data Warehousing Sanity Check

The Data Workbench

The data workbench is an area that includes ephemeral workspaces with information lifecycle management (ILM) rules and disposes of any stored data (and infrastructure, if in the cloud) after a predetermined time has passed. The data workbench is primarily used by data analysts and scientists who may be doing data discovery or building predictive models that do not need the resulting data to be stored for perpetuity. The automated ILM process keeps the workbench area clean and prevents the threat of creating non-production-supported shadow IT systems. Any data created in the workbench deemed production worthy and made available for other users or applications must go through a rigorous data governance and promotion process before being moved to the data lake or data warehouse.

For more insight into big data technologies and trends, get the free Big Data Sourcebook

The Data Presentation

The presentation layer is a fully structured, fully governed environment with its data prepared, formatted, and stored specifically to support arbitrary user queries. This is the area where the average business user accesses data to create and run reports, dashboards, and visualizations. The presentation layer is the “top” tier of the data pyramid, with the assumption that there is much data within the data lake that does not make its way to the fully structured presentation tier. The presentation database can be your existing data warehouse using traditional RDBMS and MPP technologies. However, as SQL-compliant technologies within the big data ecosystem mature, I predict this will become the new standard for presenting data.

The Data Glue

Extract, transform, and load is still required within a modern engineered data ecosystem. The acceptance of Spark by large companies such as IBM is strong evidence that Spark may become the new standard for data transformation and preparation for analytics. Although not required, the ability to use notebooks to create a common interface may make Spark more preferred in those industries where people with varying levels of understanding of technology need access to the data. New-generation ETL tools are now being built from the ground up using Spark as their core operating platform. Look for more growth in this area in the coming year(s).

What’s Ahead

With the help of the continuing big data era, the data warehouse is making greater strides toward breaking from its rigid confines. As we noted last year, the change in mindset is even more essential to catalyzing this change. In the next year, larger, traditional companies will embrace the realization that we are experiencing evolution at a breakneck speed, and we will need to be agile and collaborative while developing solutions to different data problems. Moving data solutions to the cloud is also a trend that is picking up speed and will have hockey stick growth in the coming years.

Even as the world of data is literally turned upside down, from the paradigm of “structure → load → analyze” to “load → analyze → structure,” the fundamental principles of traditional data warehousing and ETL will never be forgotten. The fundamentals of data governance and integration are still crucial elements, even in the world of the cloud, big data, and modern data engineering.

<< back Page 4 of 4


Subscribe to Big Data Quarterly E-Edition