Cost-Effective Big Data Retention Enables Better Analytics

Big data provides new opportunities to improve customer care, unearth business insights, control operational costs, and in some cases, enable entirely new business models. By having access to larger and broader data sets, you can improve forecasts and projections for the business. A healthcare organization can conduct longitudinal analysis against years of data for patients treated with coronary attacks in order to improve care and speed time to recovery. A retailer can conduct deeper analysis on buying behavior during recessionary times if they have access to large data sets collected during the last economic downturn. Additionally, organizations across many sectors, such as communications, financial services and utilities, face significant regulatory and legal requirements for retaining and providing fast access to historical data for inquiries, audits and reporting.

Explosion in Data Volumes

The explosion in data volumes and the increasing diversity of new data types have added stress to traditional data warehouses and have raised the cost to store all data on disk - essentially online and readily available for query. Information lifecycle management initiatives, while attractive as an architectural concept, are often difficult to implement, as organizations struggle with defining the business rules for which data classes require disk versus lower-cost storage options. The tendency is to keep as much data for as long as possible or at least until it becomes extremely painful, and as a last resort offload to tape, which is often due to an immediate cost-cutting measure. For many organizations, a more straightforward and practical step has been to keep recent data online and available for query - such as up to 60, 90 or 180 days old - and then archive or store the older data in a separate, lower-cost repository but keep it online and query-accessible.

Operational data stores or warehouses are well-designed to store the recent data that enables insights about customer demand and business opportunities. For many companies, user populations are constantly in flux, growing and changing in various ways with evolving reporting requirements. Recent data that's just a few months old is generally the most sought after by operational business analysts, customer service and support teams, and business line managers in order to report against the most recent transactional activity or to satisfy customer inquiries based on last transactions or statements.

Scalable Systems

Of course, historical data is necessary for compliance, accounting and long-term trend analysis. However, many businesses no longer fit the traditional model of an enterprise data warehouse that stores all data from the beginning of organization's existence to present day. Keeping older analytics data in an expensive data warehouse is not a good use of budget and resources - especially given current data growth rates across nearly every industry sector. Organizations are now grappling with how to keep that data online and accessible for the less frequent business query and the growing list of regulatory requests. The fact of the matter is that many data warehouses have grown to a size that is becoming unmanageable, and in many cases, cost-prohibitive. As new data sources are added with the goal of gaining a broader view across the enterprise, the problem of large, costly data warehouses exacerbates. Retaining big data requires scalable systems that should be at cost levels appropriate for its overall usage frequency and of course overall value to the enterprise. A traditional enterprise data warehouse that needs to provide high performance analysis and query results for thousands of operational users often comes at a premium, compared to a historical business intelligence repository that stores data online for many years beyond the original transaction date, is less frequently accessed by say a few business analysts and more importantly by compliance audit teams that dictate immediate access to the data.|

Deciding the time period for what constitutes recent data, which often ranges between 60 and 180 days old, varies by industry and data type. When the data is no longer required for business or regulatory purposes, it can be purged completely. However, new regulations in many industries have pushed out purge dates to seven years and beyond. Until the legal purge date is reached, it is important to retain accessibility to data for queries and legal compliance while reducing the cost of long-term data retention.

Let's take a closer look at a specific industry use-case where data sets are large and rapidly growing, and industry regulations are stringent requiring online access for multiple years. A large communications service provider is faced with transaction volumes in the billions of records per day. This is expected to grow rapidly over the coming years due to the diversity of communications devices that have taken off in recent years. Just consider how quickly the iPhone has been adopted worldwide and the thousands of applications now available at the consumers' fingertips. Added to this fact is the introduction of the tablet device with Apple's iPad taking the lion's share of the market. All these devices are generating billions of records every day. It's not just call details for phone usage but the avalanche of SMS/MMS content, the mobile business transactions and all the web browsing generally known as WAP content that needs to be stored.


As this data comes off the network, it needs to be ingested and stored for online access and query. However, there are two very important considerations for IT when managing and storing this data long-term. First, the speed at which the data needs to be ingested, and secondly the fact that systems need to scale cost effectively to keep this data online for many months and years all point to the need for a technology that can accommodate these ever-growing requirements. Traditional relational databases simply cannot keep up with the large volume of daily load. By the same token, a traditional data warehouse can become cost prohibitive for this volume of data and scale required over time. This machine-generated data set needs a purpose-built repository to ingest large volumes at performance, store cost-effectively and scale over time to accommodate massive growth rates.

Last but not least, the system needs to meet the query and analytics performance requirements for both the business and regulatory compliance user communities. There are a number of other industry sectors that also require a unique solution to not only store large data sets but also provide cost-effective scale and ongoing access for query and analytics. They need a system that can provide access to a large data set such as years of data, which will produce fast query response and ultimately better forecasts and projections. For organizations that have invested heavily in enterprise data warehouses, augmenting with a dedicated solution for storing long-term historical data sets can provide unique and compelling economic benefits. Implementing specific business rules around when to offload from the central warehouse to the dedicated historical data repository will ultimately help improve overall performance on the primary due to a reduced data set which will be less costly to maintain over time. Of course, if the central warehouse requires access to a broader historical data set, the archive data can be re-instated to supplement the central warehouse, which can then enable deeper analytics for those specific business use-cases.