Why Traditional Data Preparation Approaches Fail

Video produced by Steve Nathans-Kelly

At Data Summit Connect 2020, Thomas Cook, director of sales, Cambridge Semantics, described the laborious process of manually discovering, sorting, cleaning, and conforming silo'd data that consumes the lion's share of data scientists' time, and how new approaches are improving the process.

Full videos of Data Summit Connect 2020 presentations are available at www.dbta.com/DBTA-Downloads/WhitePapers.

The dirty secret of data science is that 70 to 80% of the time is spent on data preparation and feature engineering, Cook explained.

While confirming this with data scientists, they typically chuckle and say, "It's more like 90 to 95%," Cook said. 

"The aspect of discovering the data, finding the data that's suitable, cleaning, conforming, and creating features is also the least enjoyable part of their job," Cook said. 

So how can organizations make this easier and make the job more enjoyable, reducing the cost of developing the models and applications, and reducing burnout?

Some of the traditional approaches to connecting all these disparate silos of information are data warehouses. They've been around for a long time. This was the original way of trying to put all of the data in a single place to get a single view, a canonical view, of information, Cook noted. 

"But the SQL data model is very rigid. It resists change. Anytime there is a change it's very costly to implement a change to the schema or a change to report. We have to go through a lot of testing and they're very expensive. Are they going away? No, but are they evolving? Yes. Next, we saw the use of data lakes and the promise of the pristine data lake, where you drop all of your operational raw data in there and just let people go and analyze it. This promise was never fulfilled, and the pristine data lakes were quickly termed data swamps," Cook said. 

Though Spark is very widely used and adopted today, the data engineering efforts are very costly and complex, Cook explained.

"They lack lineage and are oftentimes not repeatable. We had a large telco customer who said, "We can give two data scientists the same problem, and they'll come back with two different answers." The reason is because of the complexity of where the data comes from the rules to combine and the data quality rules that they need to get those answers. They're always different," Cook said. 


Subscribe to Big Data Quarterly E-Edition