Traditional data warehousing and big data environments, when put together, are a powerful combination. Existing data warehouses store and produce analytics and reports off structured data. Big data platforms enable analyses of large data volumes, including unstructured data. One way to bring it all together is a staging approach, employing a system such as Hadoop to capture original data sources, then forward files to the data warehouse. As a rule of thumb, raw, unstructured data goes into Hadoop or NoSQL database systems, while data file types that need to be processed as repeatable insights are candidates for data warehouses.
Make the data warehouse a good citizen of the new data environment.
Organizations are looking into strategies such as data virtualization, in which information will be presented to decision makers through a service layer independent of underlying data stores. A key underlying data store may be a data warehouse, just as it may be a Hadoop-based database, or something else. With a virtualized service layer, a data warehouse can be incorporated and information rationalized with other data sources.
Carry data warehouse best practices forward.
To cost-justify and build enterprise data warehouse, data managers have typically undertaken a process of gaining business buy-in and building an ROI case. This often began as pilot or proof-of-concept projects at first, which then were extended to business units after successful results were delivered. These practices need to be continually applied to the data environment at large.
Introduce information lifecycle management (ILM) within the data warehouse.
With so much data surging through organizations, data managers need to be able to make fresh and relevant data as easily accessible in near-real time as possible, while stale data is moved out of the way to less expensive locations. Unfortunately, data warehouses have not been good at ILM, write Bill Inmon and Krish Krishnan in Building the Unstructured Data Warehouse. “Data is entered, is integrated, is used for analysis, and then is archived. These different stages of the lifecycle of data require different treatment and technology,” they state. “To try to treat the data warehouse as if it had only one lifecycle while inside the data warehouse is a mistake. For example, when data is first put into a data warehouse, all data is pretty much accessed with the same probability. However, as data in the data warehouse ages, the probability of access changes dramatically. The newer data has a very high probability of access. The older data has an increasingly lower probability ?of access.”
Accommodate all forms of data that come into the enterprise.
Enterprise data warehouses were originally designed to handle structured ERP or transactional files, not unstructured data types, such as user-generated or machine-generated files. However, today’s generation of data warehouses and appliances, along with partnering applications, function as repositories for unstructured data as well.
Develop a metadata strategy.
Having a metadata layer provides details on all the business-critical data that is available across the enterprise. “In the first generation of data warehouses, metadata was an afterthought, at best,” write Inmon and Krishnan in Building the Unstructured Data Warehouse. In the new emerging generation of data warehouses, “metadata is an integral part of architecture.”
Enterprise data warehouses have successfully been delivering a range of key analytics—from fraud detection to consumer segmentation—for more than 2 decades. In many cases, it would not pay to attempt to migrate or change these capabilities. The arrival of Hadoop and other new big data technologies means that data warehouse environments can be extended and enhanced with information that previously was too cost-prohibitive and complex for most enterprises to pursue.