Virtual Data for the High-Performance Warehouse

Nov 12, 2014

By Steve Karam

When it comes to tuning large, complex data warehouse environments there seem to be a plethora of options available. However, despite all the tuning tools and software features, consultants and training material, fully optimizing a warehouse environment is still an extremely long haul littered with caveats and roadblocks. There are multiple processes to analyze, several teams involved, and generally a huge amount of data flowing in and through the warehouse at all times. All of this activity will by necessity be focused on production source databases as well as production data warehouses.

Business intelligence, analytics, and just-in-time reporting have exploded in recent years. There was once a time when a DBA could safely perform all necessary maintenance on a data warehouse at many points throughout the week. Documentation manuals for older database software are littered with tips on shutting down access to data warehouses for the weekend while data loads are performed, dropping and recreating indexes along the way for a performance boost during the load. Today's high-velocity world does not allow for that; instead, the data supply chain must constantly be flowing from source to ODS to warehouse to the rich dashboards of eager executives and business analysts at all times. Rebuilding data models and testing new data load strategies are all but extinct options when attempting to optimize a warehouse ecosystem.

Multiple Copies of the Data Warehouse

Because of these barriers, there is really only one solution to ensure success in the complex warehouse ecosystem: create multiple copies of the warehouse. With multiple copies, data architects and developers can test new and improved load scripts. DBAs can work with development teams to refactor the architecture and tune database processes and queries. QA engineers can test changes thoroughly and even destructively to ensure a foolproof environment. And analysts, end users, executives, and reporting teams can get the data they need when they need it without impacting other users or processes.

Such a thing was once impossible due to the time and storage constraints surrounding database cloning. However, thanks to database virtualization technology it is now a reality. Database virtualization provides a mechanism that is not only capable of provisioning full copies of databases in minutes without additional storage usage, but it is also capable of orchestrating the entire process so that cloning can be done with the click of a button, no matter what the size of the source data.

Using this technology, it is perfectly feasible for each team with a stake in data warehouse performance tuning to have its own copy of:

The production database or ODS
The data warehouse itself
Online data marts

Benefits of Database Virtualization

ETL/Load Testing

Loading data into a warehouse is one of the most time-consuming and critical processes involved in BI. Due to the huge volume of data in modern enterprises and the requirement that data warehouses remain constantly available, development teams and data architects are faced with a difficult task when deciding upon a way to get data in when it is needed. These processes usually involve complex job chains with a variety of tasks and subtasks, and the failure of any one of them can result in a failure of the entire batch process. Additionally, it is easy for performance to suffer from poorly tuned data loads due to the high volume of data writes and countless queries that have to occur against large warehouse tables.

By provisioning full copies of source and target databases to batch job developers, it is possible to fine-tune these processes without impacting end users. Development teams can test new means of data loading without fear of breaking the database or degrading performance for others. And because the databases they are working against are full, virtualized clones of the source databases, they are able to test their loads in full to ensure they will run to completion before running them against the real warehouse. If there are issues, the virtual database can be rewound and the tests can run through again.

Offloaded Reporting

Even with the huge strain it puts on databases, networks, and storage, most companies are still running reports against production databases. With the boom of modern BI with myriad dashboards, analytics, and data gathering tools the situation becomes even more troublesome for performance, with dozens or hundreds of users and processes vying for access to data at all times of the day. The outcome is contention, either in performance on the source databases or in the delicate scheduling for job windows on a data warehouse.

Data warehouses are now only 24 hours a day, 7 days a week, and data must be consumable in an instant. Missed reporting equals missed analytics and missed opportunity. Despite the performance implications of running reporting against production databases or warehouses no matter what their current state, it is being done every day.

Virtual databases provide a huge boost here, with the ability to spin up, at the click of a button, in minutes from any point in time without additional storage usage. They can similarly be retired in minutes, allowing reporting to take place in any ad hoc manner without wasting resources. Full copies of source data, both from reporting and from the warehouse, can be provisioned and refreshed continuously with no performance impact during their creation or use. And because a virtual database can be provisioned from a past point in time, reporting can run at any time against pivotal intervals, such as the end of a quarter or fiscal year.

Performance Tuning Tests

It can be very difficult to test new indexes or performance enhancements on a running data warehouse. Because of this, developers and DBAs will usually test enhancements against out of data or incomplete development copies of the database, or worse yet not test at all and just hope for the best. Both scenarios present a world of potential issues and ultimately make it extremely hard for the original performance concerns to ever be addressed.

In this case, the solution is simple: clone the data warehouse in its entirety and run queries without worrying about impacting the data warehouse. Index testing, parameter changes, and even upgrades of the database software can be tested without any impact to the source environments. An advanced database virtualization platform can even create virtual databases off of other virtual databases, allowing teams to provision a copy of the production data warehouse, perform changes, and provision multiple copies of the changed development warehouse out for further development and QA. When these databases are no longer useful, they can be decommissioned or refreshed from the source with fresh data to continue testing against an updated set of data.

Parallelized Workloads

While most data warehouse environments are generally built to be extremely powerful and capable of performing large parallel workloads, sometimes that power just isn’t enough. Despite our best efforts in managing scheduling windows and building up parallel capabilities in our loading processes, there are still only 24 hours in a day.

However, because multiple virtual databases can be refreshed either off different points in time or continuously from the current point in time, it is possible to perform parallel workloads over multiple systems. This means that instead of trying to fit a dozen loading or transformational procedures into a single 24-hour window on a single production warehouse database, virtual databases can be provisioned to perform chunks of work and then import the data into the main data warehouse upon completion. In a CPU bound or time bound environment, this extra processing power can mean the difference between running jobs all day and running jobs in the span of a few hours.

Ensuring a Successful Data Warehouse

Although tuning a data warehouse and its associated ecosystem is still a considerable process, the best way to ensure success is to ensure that tuning, testing, and refactoring can take place at all. By introducing the ability for every team or developer to have an individual fully-functional copy of any source database, no matter what the size, in minutes, a foundation for a successful tuning strategy is within reach - right at the click of a button.

About the author:

Steve Karam is an Oracle Certified Master (OCM), Oracle ACE, and Cloudera Certified Developer for Hadoop (CCDH) with over 15 years of DBA, management, instructor, and development experience with hundreds of companies in a wide variety of vertical markets.