Image courtesy of Shutterstock.
Giving the Data Center the "Moneyball" Treatment
Applying automated analytics to a wide variety of IT and business metrics is rapidly becoming the “next big thing.” When considering analytics solutions to add value on top of IT technology and business metrics, there are two main architectural considerations that need to be evaluated: the generalized database/data mart approach (centralized, distributed, and/or federated), and the actual technology by which various disparate data sources are integrated operationally (whether using structured SQL type database queries, custom APIs, and/or web service APIs).
Database/Data mart Considerations
Centralized database approaches and considerations
Creating a single central data repository has been used for decades as a basis for building analytics and reporting solutions. When considering the centralized database approach, there are two variants that are usually found.
Traditional data mart
The traditional approach, typically used for business intelligence type analytic use cases, involve the creation of a highly customized schema and/or OLAP data mart type architecture. In this case, the custom schema approach lends itself to the rapid execution of highly complex queries. Queries that were predefined and not usually ad hoc in nature, and hence not particularly flexible or easily modified. The process of writing data and metrics into these repositories is typically a processing intensive process not lending itself to the rapid population of vast numbers and types of data sources and metrics. This approach toward analytics has some benefits such as the ability to create highly customized schemas that will ensure that highly complex analytics can be done and delivered in a timely fashion (though this is offset by the concomitant requirement that the analytic user either be, or work closely with, a SQL or relational data base expert; especially for new and/or previously unanticipated analysis and reports).
“Big Data” approaches
The “new kid on this block approaches” - increasingly being referenced as big data - involves the use of a more generalized database/data storage structure that is primarily intended to support the integration and analysis of, and across, non-structured data sources (e.g., log files, etc.). One important consideration of this approach relates to adaptation of this approach to working with time-series performance metrics that vary over time. In these use-cases, the number of metrics being populated can range to the hundreds of thousands per second in large environments. These data marts, rather than being pre-optimized for historical query, are in effect “write-optimized” databases designed around the use case of aggregating widely disparate data types into a single centralized place for subsequent, downstream analysis and reporting.
The nature of real-time monitoring solutions ensures there will be mass quantities of data effectively being sourced over short periods of time and more importantly, that it is impossible to require real-time monitoring solutions to have to pause their monitoring activity in order to write their data into a traditional database, because that would mean that measurement data was lost. So, while being able to populate data very rapidly in these types of use cases and environments, they are not optimized for analytic and historic data query-based use cases. In effect every query into these types of data marts is effectively an ad hoc one; the more complex the analysis and/or query, the more the performance and latency of that analysis/query will suffer.
Whether done via new and evolving large, real-time, or big data type approaches built on top of the Hadoop ecosystem, or using proprietary relational database or OLAP data mart approaches, the centralized database approach requires that all data is essentially duplicated from its original source(s). This requirement for data duplication or replication can be a serious obstacle for organizations that are considering integrating business or other proprietary metrics into a combined analytic solution with IT metrics. In the case of financial services and healthcare industries there are frequently internal and external audit and compliance regulations that forbid the duplication of much or in some cases any, data. In these cases the data duplication requiring centralized approach cannot be used to its full extent or in some cases at all.
Data Federation Considerations
True data federation is now becoming an increasingly attractive adjunct/alternative to the creation and use of centralized data mart approaches. Rather than creating and maintaining an additional centralized data store, an approach where any and all existing data stores are in essence shared at read time is used. This approach builds on the reality that most existing management tools and processes already have existing stores of both performance over time information as well as contextual, metadata.