Image courtesy of Shutterstock
As technology continues to advance, so does the volume of data that companies must store and process. Traditionally, companies have relied on data warehouses as their main data repository. But a new concept for data storage is provided by the data lake approach, specifically using Hadoop.
Hadoop is ideally suited for a data lake because it’s inexpensive, open source, and scalable. What makes Hadoop unique in the marketplace compared to traditional database management systems (DBMSs) is its ability to process data without a common schema. As a result, many in the industry have begun to look to data lakes and Hadoop as the future for data storage. To help shed light on the data lake approach, the pros and cons of this data repository were considered in a recent Unisphere webcast presented by Peter Evans, BI and analytics product evangelist and product technologist consultant, Dell Software; and Elliot King, Unisphere Research analyst.
Looking at the long-term trends in the data industry, King observed, “Whenever we come to a pivot point we seem to forget that this is important, that this is a turning point, but it is a turning point from somewhere. We have been on a very long road.” The trends to keep in mind when considering the data lake are: the explosion in data, digital information, the drop in hardware cost, and the desire to analyze more data, he said.
Since the beginning of computing in the 1950s, data has consistently been on a J curve, and with the constant improvements in technology, there is no reason to believe this will slow down, King said. With the increase in technology we have shifted toward digital media which is better for data storage, and the technology that stores data has become relatively inexpensive. Arguably, the most consistent trend is the desire to analyze more data, said King, describing it as the weather prediction syndrome. “If we gather enough data eventually we will have all the answers,” said King.
The established method of aggregating data from multiple sources for business intelligence and analytics is a data warehouse. The issue though is that with the volume of data today, data warehouses are not scalable enough or agile enough to keep up.
This is where the data lake comes in with Hadoop. Both are flexible, scalable, open sourced and have performed well with large amounts of data; especially of different schemas, he said. The downside to this is the technology is that it is very new and complex for many industry experts still. Its relative complexity raises questions to the data quality and the elimination of data governance. King emphasized that while Hadoop data lakes are exciting, the industry is just beginning to understand them.
Evans cited the five laws of data integration: the whole is greater than the sum of the parts, there is no end state, no universal standards, information adapts to meet local needs, and all details are relevant. “Here at Dell, we are looking at the data lake as just another form of storage,” stated Evans, who went on to explain that the data lake will be part of a hybrid ecosystem, but not the only repository for data storage. Important factors when considering the Hadoop data lake are whether a company is going to be using a Hadoop infrastructure and also having an analytical master plan, he said.
A replay of the webcast, “The Pros and Cons of a Hadoop Data Lake,” is available for replay.