The Rise of the Data Catalog

Dec 1, 2019

By Craig S. Mullins

If you have been around the IT industry for as long as I have, you have seen technologies and ideas come and go—and sometimes even come back again. This is surely the case with the “new” products that call themselves data catalogs.

But before talking about the rise of the data catalog, I will take a look at its predecessors, the data dictionary and the Repository, which I will refer to with a capital R to differentiate the metadata Repository from the more generic data repository.

All of these products are designed to capture, store, and manage metadata. For any piece of data to be understood, metadata is required. Metadata characterizes data. It provides documentation about the data so that it can be understood and more readily consumed by an organization. Metadata answers the who, what, when, where, why, and how questions for users of the data.

Let’s start with the data dictionary. The term “data dictionary” is no longer in vogue, but it is a centralized location for storing and defining metadata. Data dictionaries first started to rise to popularity in the 1980s. You can think of the system catalog for an RDBMS as a data dictionary for technical metadata (and, indeed, Oracle refers to its system catalog as the Data Dictionary). A data dictionary typically enabled the collection, storage, and management of information about data elements, data types and lengths, relationships to other data, and so on.

Repositories started gaining popularity in the 1990s with offerings that included IBM’s Repository Manager/MVS and the Platinum Repository. The metadata Repository extended the data dictionary concept to include more information about the metadata than the basic type and length details. This included business metadata, such as useful definitions and descriptions; operational metadata, including provenance and lineage details on where the data originated and how it was captured; and other usage details, such as where the data was used in application programs and systems.

Obviously, knowing what data you have, where it is used, and why it is important are useful details for most organizations. Implementing a Repository typically requires a large project to discover metadata and continue to update the Repository with accurate metadata. Such projects are time-consuming, and then, as time progresses and things change, the Repository will become outdated unless proper procedures are instituted to update it as changes and new projects roll out. This requires centralized management and significant planning and coordination from the beginning to succeed.

Enter the Data Catalog

And that brings us to the data catalog, which extends the concept of metadata capture and management further through automation and modern discovery techniques. Gartner has defined a data catalog as a tool that “creates and maintains an inventory of data assets through the discovery, description and organization of distributed datasets.”

The ability to keep the information in a data catalog up-to-date using automated discovery and search techniques solves the greatest failure point of Repository products—keeping the data fresh and useful without requiring tedious manual processes. In many cases, AI and machine learning capabilities are being built into data catalogs for scanning and automatically discovering metadata and meaningful relationships among and between the data elements. By using machine learning to simplify and automate the discovery and maintenance of metadata, a searchable data catalog can act as a recommendation engine for data.

It has long been the goal of data professionals to create and maintain an inventory of all the data assets of an organization, but the goal has proven to be difficult and costly. Data silos, immature tools, rapid data growth, and multiple copies of data have contributed to this failure. But now with the data catalog and its improved capabilities for automation, discovery, and classification, the ability to create and maintain an inventory of your organization’s data may become possible.

Using a data catalog can deliver context to your organization’s data so that data users—such as data scientists, developers, data analysts, and other business data consumers—can find the data they need and understand the meaning of the data they are using. A data catalog can capture and manage all types of metadata, whether technical, operational, or business in nature. Most data catalogs also provide collaboration capabilities that enable the capture of additional user-provided or social metadata. Furthermore, a data catalog can help to determine data lineage and perform impact and usage analysis.

The era of the data catalog is upon us and most organizations should be looking to implement a data catalog soon, if they have not already. n