As businesses push to reduce the data latency between analytical systems and operational systems, data warehouses begin to take on more of the character of a transactional system. For a data warehouse to deliver near real-time information, the choices generally are to update the warehouse more frequently or access data directly from operational systems. Either way, the push to reduce latency changes the nature of database performance to support the data warehouse. The main reason we created data warehouses in the first place was to separate resource-intensive analytical processing from shorter duration, but very frequent transaction processing. As the two worlds now come back together, the churn pressure on the database system can be significant.
But even if the same data is accessed for both transactional and analytical applications, real-time goals will force different performance metrics on data warehouse databases. Database metrics evaluate transactions based on speed: that is, their ability to “get in,” “do their work,” and “get back out again.” Data warehousing queries cannot be measured in the same way. The analytical nature of business intelligence means that data warehousing queries perform large and sometimes complex operations to arrive at heretofore-unknown correlations in the data. Speed is not as important as accuracy. A better metric for such queries is repeatability; that is, does the same query against the same data at the same time of day result in similar performance characteristics? Managing performance in this way requires a different mindset. It necessitates the ability to capture and analyze trending patterns, perhaps by performing the same type of analytical queries on performance metrics as are being performed on the data in the data warehouse.
Another challenge is that many data warehouse queries need to access aggregated data. With a separate data warehouse environment, database administrators can denormalize physical database structures to create aggregate tables, which contain redundant data summarized from other data. The purpose of the aggregate tables is to optimize performance and increase data availability — both noble goals. However, these tables add to the size of the data warehouse and the complexity of managing the environment.
The major relational database systems support aggregation in a more easy-to-use, performance-oriented manner. DBAs can create materialized query tables (MQTs) and automated summary tables (ASTs) that essentially turn a query into a physical table. But MQTs are automated and aware; when the DBA uses an MQT, queries accessing the base tables do not need to be re-written. Instead, the database optimizer understands the relationship between the MQT and the “real” tables accessed to create it. The DBMS itself decides when and how to use the MQT versus the base tables. Automated refresh and management commands simplify MQT implementation. But the more modifications made to the data in the warehouse, the less usable MQTs become because of the time required to rebuild the physical data structure used to store the materialized data. The bottom line is that MQTs can bolster the DBA’s ability to deliver a real-time data warehouse because they lead to better performance against normalized database implementations. But at the same time, the DBA must be aware of the limitations caused by latency when data changes.
As data warehouses evolve to meet real-time demands, how we manage database performance must also evolve, expand, and become more agile. Your organization will need administrators who understand not only the basics of database performance, but also the implications of real-time data warehousing on the performance management. By taking advantage of the autonomic features of DBMS and systems management software, you can simplify and streamline the performance of your data warehousing systems.