Big data isn't notable because there is a lot of data that can now be stored and processed more cost-effectively. It is important because of what can be done with it.
One approach that has been heralded for gaining more value from big data is the data lake, noted Jonathan Gray, CEO and founder of Cask in a presentation titled, “Building an Enteprise Data Lake,” at Data Summit 2017, taking place at the New York Hilton Midtown, May 16-17, 2017.
The session was part of the Hadoop Day track, which was moderated by Unisphere analyst Joe McKendrick.
The problem is that big data involves projects not products, there is also a lack of expertise, and success largely hinges on one-off heroic effort, not repeatable blueprints, said Gray. In addition data integration is challenging because it is complex and time consuming, there is the risk of creating a data swamp, and often point solutions result which create shadow IT.
Ideally, a data lake should allow any user to have self-service access to any data with required security and governance built in, and the ability for analysts, scientists and developers to easily access data.
However, the big data technology stack does not deliver a data lake, noted Gray. Instead, it provides services from which a data lake can be assembled.
In Gray’s view, there are three steps to building a data lake. First, create the “data pond” with raw data copied from existing internal data stores and outside data sources. Once use cases have been defined, the “data lake” can be created with raw and defined data from other systems into a centralized cluster. And last a “data reservoir” can be created using raw and defined data which is governed and audited to ensure compliance and security.
Each of these steps has its own challenges, but it is critical to start with a pond before moving on, said Gray. In the data pond, for example, there are manual processes and it can hard to find data. In the data lake, ensuring data is in its canonical format is critical, and tasks are often very human driven and there is not enough self-service. And, with the data reservoir, guaranteeing compliance can be difficult and sharing infrastructure in a mutitenant environment without quality of service support can also be hard.
To simplify the process, Cask offers the first unified integrated platform for rapid time to value from big data with the Cask Data Application Platform, said Gray, who presented a live demonstration of putting the technology to use.
Many conference presentations have been made available by speakers at www.dbta.com/datasummit/2017/presentations.aspx.