Why is Data Integration So Hard Now? (VIDEO)

Video produced by Steve Nathans-Kelly

As data flows into organizations from a previously unimaginable array of sources and at greater speed and volume than ever before, the challenges of cleaning, deduplicating, and integrating data are also increasing

At Data Summit 2019, Pythian’s Danil Zburivsky considered the question of why data integration is so hard today in a presentation titled, “Dismantling Data Silos Through Cloud Integration.”

DBTA’s next Data Summit conference will be held May 19-20, 2020, in Boston, with pre-conference workshops on Monday, May 18.

“I think we can actually talk about several things, starting with the data volume itself. That's something that's kind of has significantly increased in the last several years.

“At the keynote today, I think Michael [Stonebraker] said this is a solved problem: You just buy a Teradata or a Netezza, and they do this massive parallel processing like this is a done deal. But guess which service is most in demand at present today? Migration from Teradata, migrations from the Netezza. So you can solve this problem, you can buy a little expensive appliance. But the thing is, the cost—the support costs, the maintenance cost of the systems—is too much for many organizations. People just can't afford it anymore. And there are more accessible offerings today that you can look into.”

Echoing Stonebraker’s keynote at Data Summit, Zburivsky said variety is truly the 800-pound gorilla in the room. “I think variety is really one of the key driving factors behind the challenges and the ETL. ETL tools just have to be able to deal with different file formats, different sources, and they really have a hard time keeping up.”

Lastly, he noted, an often-overlooked element in data integration difficulty is the veracity of data. “When I bring all the data sources of which ones are actually one that I can trust, like what actually was high-quality data, what's the data I can build my financial reports on versus what's the temporary sandbox area for my data science scientists to play with, right? So this is another big, big challenge with ETL that we see."

Many presenters have made their slide decks available on the Data Summit 2019 website at

To access the full video of Danil Zburivsky’s Data Summit 2019 presentation, "Dismantling Data Silos Through Cloud Integration," go to: