Managing Data That is Constantly on the Move

Data is not sedentary. Once data has been created, it gets moved around to support many different purposes. One type is production data versus test data, and it is likely that there are multiple test environments, for example, to support unit testing, integration testing, new product deliveries, training, and so on. But there may be multiple copies of the “same” data in each of those environments to support different applications, different geographies, different users, different computing environments, and different DBMSs.

Rarely is a single copy of any piece of data good enough. Data is copied and transformed and cleansed and duplicated and stored many times throughout most organizations. Different copies of the same data are used to support transaction processing and analysis; test, quality assurance, and operational systems; day-to-day operations and reporting; data warehouses, data marts, and data mining; and distributed databases. And who manages all of this moving data? Typically, it is the DBA group.

There are many techniques that can be used to facilitate data movement. One of the simplest ways for the DBA to move data from one place to another is to use the LOAD and UNLOAD utilities that come with the DBMS. The LOAD utility is used to populate tables with new data (or to add to existing data), and the UNLOAD utility is used to read data from a table and put it into a data file. Each DBMS may call the actual utilities by different names, but the functionality is the same or similar from product to product. For example, Oracle offers SQL*Loader and Microsoft SQL Server provides the BCP utility.

Although unloading data from one database and loading it to another is a relatively simple task, there are many factors that can complicate matters. For example, specifying the sometimes-intricate parameters required to un/load the data in the correct format can be daunting, especially when the underlying source and target table definitions differ. Furthermore, unloading and loading is not very efficient for large tables, or even when many small-to-medium-sized tables need to be moved or refreshed.

Some DBMS products offer export and import utilities. Although similar to unload and load, import and export facilities typically work with more than just the data. For example, exported data may contain the schema for the table along with the data. In such cases, the import facility can create the table and import the data using just the export data file. Another difference is that an export file may contain more than just a single table. Nevertheless, export and import typically suffer from many of the same issues as unload and load, including inefficiency.

Another method for moving large quantities of data is ETL (extract, transform and load) software, which is primarily used to populate data warehouses and data marts from other data sources. Although many DBMS vendors offer ETL software, it usually is not included with the base DBMS license. Using ETL software, the DBA can automate the extraction of data from disparate, heterogeneous sources.

For example, data may need to be extracted from legacy IMS databases and VSAM files from the mainframe; relational databases such as Oracle, SQL Server, and IBM Db2 on various platforms; spreadsheets stored on the LAN; as well as external data feeds. The ETL software can be set up to recognize and retrieve the data from these many different sources. Once retrieved, the data may need to be transformed in some fashion before it is sent to the target database. ETL software can be more flexible for complex data movement than simple unload/load or export/import utilities.

Yet another method of moving data is through replication and propagation. When data is replicated, one data store is copied to one or more different data stores, either locally or at other locations. Replication can be implemented simply by copying entire tables to multiple locations or by copying a subset of the rows and/or columns, and it can be set up to automatically refresh the copied data on a regular basis.

Propagation is the migration of only changed data. Propagation can be implemented by scanning the database transaction log and applying the results of data modification statements to another data store. Initial population of a data warehouse can be achieved by replication, and on-going changes by propagation.

Messaging software, also known as message queuing software or application integration, is another form of data movement. When using a message queue, data is placed onto the queue by one application or process; the data is read from the queue by another application or process. Messaging software works by providing APIs to read and write formatted messages to and from a queue. An application can read or write messages to and from the queue from any platform supported by the software.

Of course, many other methods exist for moving data—from the simple, such as using a table editing tool to highlight and copy data, to the complex, such as writing programs to read the database and write to external files or directly to other databases. There are also vendor products that bypass database control to copy and move data at the file level, which typically outperform standard database utilities. Furthermore, some DBMSs provide additional built-in methods for copying and moving data, such as Oracle Transportable Tablespaces.

The bottom line is that data is constantly being moved from one place to another in most companies. Indeed, it is not hyperbole to say that a lot of CPU power is dedicated to moving data all over the place. And DBAs are constantly being asked to move more data—and to do it faster and more efficiently. DBAs should keep up with the data movement needs of their organization and deploy the proper techniques and products to keep that data moving effectively.