The Future of Data Lakes: Cloud, Object Stores, and Spark

Paul Sonderegger, senior data strategist, Oracle, looked at the way data technologies are evolving to help organizations extract more value from the wealth of data being collected during a keynote at Data Summit 2018.

Data has been called many things such as the new oil and the new electricity, but, said Sonderegger, it is really the new capital, on a par with financial and human capital  for creating new products and services. When we say that data is a kind of capital, it’s not a metaphor; it is literal, he said, explaining, “In economics, capital is an asset produced through some process and is then a necessary input to some other good or service. Data fulfills this definition.”

Much of the information that companies capture is for a specific purpose. Data lakes make it possible to repurpose this information for other uses.

However, while data lakes are growing in use, he said, Hadoop which has been closely associated with data lakes is shrinking.

According to research presented by Sonderegger, 20% of respondents to a 2016 survey had a data lake and 55% had a Hadoop installation. In a 2018 survey, however, 37% of respondents had a data lake but only 44% had a Hadoop installation, he noted, citing the difficulty in maintaining Hadoop deployments on premise.

The wave of the future for data lakes is cloud, object store databases, and Spark, he said.

Today, about 40% of Hadoop cluster deployments are in the cloud and about 60% are on premise. 

The study also found that 26% of respondents had deployed Spark, the in-memory, parallelized, open source analytics framework for their data lake, and of those, 31% had deployed Spark in the cloud, with 15% considering it.

Object stores are being embraced as well for data lakes, he said. Object stores represent decades-old technology but are highly effective for holding diverse data types.  Object stores are deployed by 24% of the respondents to the 2018 survey, with 15% considering it, and of those who have deployed object stores for their data lakes, 21% deployed in the cloud.

Sonderegger concluded his presentation noting that “Friedrich Hayek, one of the most influential economists of the 20th century, once wrote, ‘The economic problem of society is … the utilization of knowledge which is not given to anyone in its totality.’”

Hayek “was highlighting the supreme importance of getting knowledge to its highest-value points of use. With the digitization and datafication of everything, that knowledge is increasingly digital,” said Sonderegger.

