Examining and Understanding Data Lakes at Data Summit 2017

May 26, 2017

By Stephanie Simone

Data lakes may not be the panacea everyone thought they would be but if used properly a data lake can be a rejuvenating force within an organization.

Vincent Yates, director of analytics engineering at Zillow Group, explained the key tenents to building a successful data lake at Data Summit 2017.

Computers, including machine learning and data science capabilities, have evolved so much technology knows us better than our closest friends, Yates said.

However, despite all that information there is still a mystique about what it all means. Twenty-four percent of data scientists are unsure of what their data means, Yates explained.

Errors propagate in dynamic ways and cleaning data to uncover the right insights is a consuming task.

According to Yates, 64% of data scientists say that poor data quality is the biggest hurdle within businesses and it consumes 15-25% of enterprise profits.

“We keep shoving stuff into this old paradigm,” Yates said. “We have to be proactive, not reactive.”

Enterprises need to know about their data quality in real-time along with receiving information about when an issue arises right as it begins.

By setting up an alarm as data gets put into data lakes, users will get an alert about the type of data going in and what’s happening with it.

Businesses have been obsessed with the wrong question, instead of asking what data do we need to make models better, we should be asking what models do we need to be making data better.

“Ask not what your data can do for your models, ask what your models can do for your data,” Yates said.

Many conference presentations have been made available by speakers at www.dbta.com/datasummit/2017/presentations.aspx.