Data Lakes Keep Rising While Hadoop Sinks

Aug 8, 2018

By Joe McKendrick

This may seem contradictory at first glance: Fresh data from the database user community finds that data lakes continue to increase within the enterprise space as big data flows get even bigger. Yet, at the same time, enterprises appear to have pulled back on Hadoop implementations. In earlier days, Hadoop would have been seen as the supporting framework for data lakes. Now, there are many choices emerging for data managers to maintain their data lakes, especially from a plethora of cloud services. If anything, these two opposing trends point to a growing diversity of platforms and approaches now being used to move data as quickly and efficiently as possible from source to scorecard.

New developments in the data world—from cognitive computing to the Internet of Things—are making it critical to take lots of data feeds, pull the points that are of material importance, and engage with them in real time. There are a variety of tools, platforms, and frameworks now available to enterprises to better manage their data. In March 2018, Unisphere fielded a study among DBTA readers to explore the role of new technology initiatives in managing and making this data actionable for the business. This study, sponsored by Oracle, gathered the views and experiences of 203 IT decision makers, representing a broad sample of company types and sizes.

Next-generation data technology initiatives explored in the survey include data lakes, machine learning, Hadoop, Spark, object storage, and the platforms and environments that are supporting them. These distinct technologies are interacting with each other, converging, and paving the way to data-driven enterprises. The survey identified the following five key trends shaping the way enterprises leverage their data, as well as the evolving priorities of data managers.

Data lakes are continuing to grow.

Data lakes—a place to store diverse datasets without having to build a model first—are perhaps the most mature technology initiatives seen among enterprises in the survey. Adoption of data lakes continues to rise as data managers seek to develop ways to rapidly capture and store data from a multitude of sources in various formats. Overall, 38% of organizations are employing data lakes as part of their data architecture, up from 20% in the 2016 survey. Another 15% are currently considering adoption (see Figure 1). Data lakes are growing to impressive propositions as well. Close to one-third, 32%, now support more than 100TB of data.?

Hadoop is past its prime.

Hadoop, the big data framework that made massive-scale data analytics a reality for every company that needs it, is beginning to show its age. In the survey, 44% of enterprises reported having Hadoop in production, which represents a downward shift from 2016, in which 55% reported using the framework (see Figure 2).

Spark is moving toward the mainstream.

While it remains to be seen what role cloud computing will play in propelling Hadoop engagements, the cloud appears to be a natural setting for its newer sibling, Apache Spark. Spark, an open source analytics engine targeted at large-scale data processing at real-time speeds, is being used at one in four organizations as part of the data architectures, with another 15% considering adoption. A majority, 56%, either run Spark in the cloud (31%) or intend to do so over the coming year (25%). The most common use, cited by 59%, is real-time analytics/operationalized insights.

Machine learning is catching on.

Machine learning, a part of artificial intelligence, has been described as “getting computers to act without being explicitly programmed.” Data managers in the survey expressed quite a bit of enthusiasm for this higher level of automation. One in four organizations is employing machine learning as part of its data architecture. Another 20% of respondents said their organizations are considering adoption.

Object storage is gaining traction.

While Hadoop implementations have decreased, it appears some data managers are favoring object storage as an alternative to the Hadoop File System. Object storage, which is designed to contain identifiers, metadata, and files within single units and be deployable across any and all devices and systems, is a relatively new phenomenon, but nonetheless already has a sizable base of adherents. One-quarter of data managers reported their organizations are employing object storage as part of their data architecture. Another 15% are considering adoption.