Data Discovery is Next Evolutionary Step in Data Integration

"In the struggle for survival, the fittest win out at the expense of their rivals because they succeed in adapting themselves best to their environment." - Charles Darwin, The Origin of Species

Data Integration (DI) technology, (specifically, extract, transform, and load (ETL) middleware), when combined with an intermediate data store such as a warehouse or mart, has played a key role in advancing business intelligence (BI) and performance management since the mid-1990s. Virtualized DI evolved from these technologies in the mid-2000s. Alternatively known as virtual data federation or enterprise information integration (EII), virtual DI eliminates the intermediate data store by leveraging high-performance query techniques that let the consuming application pull data directly from the source, in real time.

The next evolutionary DI step is currently in a nascent stage. Data discovery revolutionizes how business professionals can leverage enterprises’ ever-expanding data assets, thus changing the competitive dynamic with its speed and simplicity.

Drivers of the DI Evolution: Data Volume and Source Complexity

Recently, IDC estimated the rate of compound enterprise data growth to reach nearly 60 percent annually. In other words, enterprises will likely have 10 times today’s data by 2013, and 100 times by 2018.

Concurrent with this growth has been the rapid expansion of data complexity. Data can be structured in rows and columns within transactions systems. Data can be unstructured in documents stored on desktops. Recent advancements with new XML standards have opened the door to semi-structured data, which is often available through web services in a ervice-oriented-architecture (SOA).

Today’s enterprises typically have hundreds, if not thousands, of unique, structured data sources built, bought and/or acquired via merger. Each has its own syntax, access methods, metadata and more, presenting myriad challenges to the access and use of these information assets. New applications, such as management portals, e-commerce solutions and performance analytics that require data from diverse sources, add more complexity. These applications need data in a specific format not typically compatible to how data is stored in its original sources.

Beyond volume and complexity, time to solution is another significant factor. Business change equals IT change. The proverbial endless backlog greatly impacts enterprises’ abilities to successfully adapt to market changes. Accelerating new projects through better tools and/or fewer steps has become more important than ever.

Helping Business Professionals Discover Their Data

Regardless of the DI approach taken, IT professionals are currently the primary go-to-data source in today’s enterprises. Business users such as business analysts, engineers, scientists, production planners, customer service managers are the primary data consumers. This business dependency often creates frustrations between groups. Facing numerous requests, IT daily deals with backlogs and delays. To operate most efficiently and meet these high volumes of demand, IT requires that data consumers request the exact information they need. Yet, this isn’t as easy as it sounds.

Daily, business professionals face new and potentially unforeseen problems. They also need to resolve unanticipated issues or answer new questions as they arise. This variability makes it difficult to anticipate what information will be required prior to making informed decisions, answering questions or resolving issues.

Data Discovery - Regaining Competitive Advantage

Data discovery applications are end-to-end solutions that let business professionals “do it themselves” with minimal IT assistance. Complementary to existing reporting and analytic solutions, data discovery currently opens the door to structured and semi-structured data across the enterprise.

Specifically, data discovery allows business users to find the data they need using a keyword search paradigm, relate that discovered data to other data in the enterprise to get a complete picture, and then share the results with colleagues using applications such as Microsoft Excel. In short, the business user can go from raw data to having his or her question answered in a few minutes, with minimal to no IT involvement.

IT’s Role in a Data Discovery Environment

Because business professionals require less IT intervention in the data discovery process, what is the evolved role of IT? IT provides critical, behind-the-scenes expertise in typical data discovery deployments. At set-up, IT installs the tool, giving account credentials and privileges to users. Next, IT grants access to source data domains. Finally, IT runs the data indexer as well as the relationship discovery tool. These set-up activities typically represent a few days’ work.

During runtime, IT periodically updates data indexing and relationships to ensure searchable data remains fresh. Further, IT administers users and adds new data sources to correspond with on-going organizational and system changes. In addition, IT can make data discovery easier and more productive by adding annotations, aliases, domains, synonyms and views.

Early adopter scenarios demonstrate that IT provides incremental support efforts that are relatively minor. An added benefit: data discovery products keep data security risks low by leveraging and conforming to existing security paradigms and controls, down to the row and column levels. Because data discovery tools are non-invasive, they add little burden to existing architectures and operations. Finally, ITs overall workload is likely to be reduced through the elimination of a large percentage of new reports and other requests from the more self-sufficient business professionals.

When to Implement Data Discovery

Determining when to implement data discovery typically begins with a survey of a representative sample of business users (10 or more). We have found the following questions to be useful during this process.

   1. What kinds of questions, problems and decisions are encountered regularly, and what kind of impact do these currently have on revenues, costs and business risks?
   2. How many hours per day are spent answering these questions/resolving problems/making decisions? Is this number increasing or decreasing?
   3. What percentage of time is the necessary information readily available? Is this number increasing or decreasing?
   4. If this problem-solving/decision-making process could be reduced by 10 percent, how much is this worth to the enterprise? What about 20 percent?
   5. If this problem-solving/decision-making process could be made 10 percent more frequently, how much is this worth to the enterprise? What about 20 percent?


As in Darwin’s natural world where the fittest species adapted to survive changing environments, commercial enterprises/government agencies and the technology that serves them are adapting to harness the explosive growth of information volumes and data source complexity. One of the technologies deployed by competitive enterprises, data integration, has recently delivered an evolutionary step in the form of data discovery that enables business decision-makers to discover the information they need to make informed decisions, answer questions and solve problems with reduced dependency on IT.