Designing for Batch Database Processing

Typically, most applications consist of both batch and online workloads. This is true even today, when most of our attention has turned to online and web-based interaction. Sure, online activities are the most obvious, in-your-face component of countless applications, but batch  processing still  drives many actions behind the scenes. This can include applying updates, processing reports, integrating input from multiple sources and locations, data extraction, database utility processing, and more.

Different types of database activities may be required to be tracked to ensure compliance. Typical categories of activity that need to be audited include the DDL (or Data Definition Language) for database structure changes, DML (or Data Manipulation Language) for data retrieval and modification, DCL (or Data Control Language) for authorization grants and revokes, security exceptions, and other types of access (such as database utilities that load and unload data).

So, with all of that in mind, let’s take a look at best practices for designing batch processes for database applications.

The first design concern for batch database programs is to ensure that database COMMITs are issued

within the program. Except for very trivial programs that access small amounts of data, database COMMITs should be issued periodically within a batch program to harden database modifications and release the locks held on the data. Failure to do so can cause problems. One such problem is reduced availability for concurrent programs when a large number of locks are being held.

Even worse, a large disruption in services can occur if the batch program fails before completing.

If a batch program with no COMMITs fails, all of the work that was “completed” before the failure must be rolled back. If this does not occur, the database will be in an inconsistent state when the problem is resolved and the batch program is resubmitted for processing. A batch program with COMMITs must be designed to be restartable. This means that the batch program has to be designed to keep track of its progress. The program needs to record its state when the last successful COMMIT was issued. In addition, logic must be coded to reposition all cursors to that point in the program when that last successful COMMIT was issued. When the program is run, it should check to see if it needs to reposition, which will be the case if it is a restart after a failure, or just run from the beginning (which will be the normal case). Only in the case of a restart must the program execute the repositioning logic before progressing.

Of course, database locking is a complex subject, and there are more details than we can adequately cover here. Each DBMS performs locking differently, and you will need to study the behavior of each DBMS you use to determine how best to set locking granularity and isolation levels, and to program to minimize timeouts and deadlocks. That said, failing to issue COMMITs is one of the biggest design problems with batch database programming.

Another problem that occurs frequently with batch database program development is the tendency for developers to think in terms of file processing, rather than database processing. This is especially true for programmers who have never worked with database systems. Application developers who are accustomed to processing data a record at a time will make very poor relational database programmers without some training in the set-at-a-time nature of accessing data in a relational database.

Each developer must be trained in the skills of database programming, including SQL skills, set-at-at-time processing, and database optimization. The responsibility for assuring that developers have these skills quite often falls on the DBA, but it should be an organizational imperative driven by IT management.

Finally, batch programs typically are scheduled to run at pre-determined times. The DBA should assist in batch database job scheduling to help minimize the load on the DBMS. Batch jobs that are long-running and resource-consuming should be scheduled during off-peak online transaction processing hours. The job of scheduling software can be complex and difficult to learn, but programmers and DBAs should make an attempt to understand the schedules in order to program and manage them effectively.