Why Is Enterprise Data Integration So Challenging? (VIDEO)

Jun 18, 2019

By Joyce Wells

A.M. Turing Award Laureate and database technology pioneer Michael Stonebraker delivered a welcome keynote at Data Summit 2019, titled “Big Data, Technological Disruption, and the 800-Pound Gorilla in the Corner.”

In his presentation, Stonebraker—who is an MIT Adjunct Professor and Tamr co-founder—offered his views on many of the thorny big data challenges facing enterprises today, the established and newer technologies available to address these issues, and the intractable problem that remains the 800-pound gorilla in the room.

_______________________________________________
DBTA’s next Data Summit conference will be held May 19-20, 2020, in Boston, with pre-conference workshops on Monday, May 18.
_______________________________________________

A persistent problem of data integration and data cleaning exists in the enterprise, Stonebraker explained in his presentation. "General Electric—believe it or not—has 75 procurement systems," he said.

"What's a procurement system? Well if you want to buy a paper clip, you go to your procurement system, and it asks you for what charging number you have, it spits out a purchase order, and you take the purchase order down to Staples and you get your paper clips. That's what a procurement system does. The obvious correct answer to how many procurement systems an enterprise should have is one. GE has 75."

Why is that? asked Stonebraker. "Well, it's because they buy companies that have a procurement system, and they sell companies. It's like bingo cards: You buy and sell things, and most things come with a lot of software, and unless you're willing to stop, spend the time to integrate all of this software into your current systems, you end up running multiple things. So they are running 75 of them."

According to Stonebraker, the GE CFO has estimated that the company could save $100 million a year if it could just manage to figure out the terms and conditions for all the other entities within GE. All that would be necessary is to identify what everybody else negotiated and demand most-favored-nation status and GE could save $100 million a year. "All you have to do is integrate 75 independently constructed supplier databases. And they have, like, 9 million supplier records, and you have to figure out which ones are duplicates so that you can figure out which ones are actually Staples: $100 million a year, a data integration and data cleaning problem," said Stonebraker.

"Enterprises also want to do data integration on parts, customers, lab data, lots of other things," he said. There is, he said, a huge amount of money in the enterprise that could be saved, and it is mainly a data cleaning and data integration problem. "Why is data integration hard? Well, for every local data source that you want to ingest, whether you're the iRobot data scientist or the GE person inside the enterprise, you have got to find the data source, you have got to ingest it, meaning convert it from whatever representation it has now into some common representation. You have got to perform transformations—so if I'm the human resources guy in New York, I deal with salaries in dollars, the guy in Paris deals in euros, I've got to perform data cleaning."

According to Stonebraker, a good rule of thumb is that 10% of data is wrong or missing and fixing it is not simple. "You have got to do schema integration. My wages or your salary, all that stuff. You have got to perform deduplication—which is GE's problem—find all the records that actually correspond to Staples. And then, oftentimes, once you find clusters of records that correspond to the same entity, you want to find what are called golden records, golden values, which is, you want to figure out which address of Staples you want to use. If I have two different ages, one of them has got to be wrong." The difficulty, he said, is that all of these processes must be handled correctly and they must be done at scale.

Many presenters have made their slide decks available on the Data Summit website at www.dbta.com/DataSummit/2019/Presentations.aspx.

To access Stonebraker's full Data Summit 2019 keynote titled, "Big Data, Technological Disruption, and the 800-Pound Gorilla in the Corner," go to https://datasummit.brightcovegallery.com/detail/videos/data-summit-2019-keynotes/video/6040469251001/keynote---big-data-technological-disruption-and-the-800-pound-gorilla-in-the-corner?autoStart=true#links