Why Is Data Integration So Hard? New Answers to Old Problems

Data is flowing into organizations from a previously unimaginable array of sources and at unprecedented speed and volume. This means that the challenges of cleaning, deduplicating, and integrating data are increasing.

At Data Summit 2019, Danil Zburivsky, director of engineering for Pythian Kick AaaS product, presented a session titled “Dismantling Data Silos Through Cloud Integration,” explaining why a cloud-native data platform may be the best way for organizations to cost-effectively deliver on the promise of better insights and more intelligent systems through data. At Pythian, Zburivsky helps customers to build, architect large mission-critical data platforms using MySQL, Hadoop, and MongoDB.

Zburivsky covered how a cloud integration approach can lead to better data governance and more accurate analysis and ensure consistency of data across systems, as well as the best practices for cloud data integration and how a cloud data platform breaks down data silos within the organization. The presentation also looks at how a luxury retail client successfully took its global sales data to the cloud to uncover new opportunities.

Currently, data management has multiple challenges:

  • Volume—too much and too challenging for organizations
  • Variety—many types of data and sources
  • Velocity—data is flowing into organizations quickly and requirements for speed in analytics are growing
  • Veracity—there is a need for understanding the data that you can trust for financial reports, etc.

In addition:

  • Today’s ETL tools and SQL offer limited functionality. Traditional systems have been designed for batch processing, warehouse design is batch-oriented and it is hard to implement real-time analytics.
  • The trend to bring your own data raises data quality issues.
  • Hadoop has not played out as the unified data platform—it turned out to be too complex to use and operate.

A modular data platform requires modular design, said Zburivsky. And it requires cloud in order to make it work.

The public cloud offers services that can be stitched together, is elastic on multiple levels, and enables platform as a service. These features combine to make the public cloud very useful as a data platform, he said, explaining that a distributed framework is critical for processing.

Many presenters are making their slide decks available on the Data Summit 2019 website at