Database Statistics and Optimization

The optimizer is the heart and soul of a relational DBMS. It analyzes SQL statements and determines the most efficient access plan for satisfying each statement. The optimizer accomplishes this by parsing the SQL statement to determine which tables and columns must be accessed. It then queries system information and statistics stored in the system catalog and directory to determine the best method of accomplishing the tasks necessary to satisfy the request.

The optimizer is essentially an expert system for accessing database data. An expert system is a set of standard rules that when combined with situational data can return an expert opinion. For example, a medical expert system takes the set of rules determining which medication is useful for which illness, combines it with data describing the symptoms of ailments, and applies that knowledge base to a list of input symptoms. A relational optimizer renders expert opinions on data retrieval methods based on database statistics stored in the relational catalog and a query input in SQL format.

The notion of optimizing data access in the DBMS is a very powerful capability that today we take for granted. In the pre-relational days programmers had to explicitly code how to access each and every data request. But today access to relational data is achieved by telling the DBMS what to retrieve, not how to retrieve it. Regardless of how the data is physically stored and manipulated, SQL can be used to access that data. This separation of access criteria from physical storage characteristics is called physical data independence-and the optimizer is crucial in accomplishing this physical data independence.

If indexes are removed, you can still access the data (albeit less efficiently). If a column is added to the table being accessed, the data can still be manipulated without changing program code. This is all possible because the physical access paths to data are not coded by programmers in application programs, but are generated by the optimizer.

The optimizer performs complex calculations based on a host of information. To simplify the functionality of the optimizer, you can picture it as performing a basic four-step process:

  1. Receive and verify the syntax of the SQL statement.
  2. Analyze the environment and optimize the method of satisfying the SQL statement.
  3. Create machine-readable instructions to execute the optimized SQL.
  4. Execute the instructions or store them for future execution.

The second step of this process is the most intriguing. How does the optimizer decide how to execute the vast array of SQL statements that can be sent its way?

The optimizer has many strategies for optimizing SQL. How does it choose which of these strategies to use in the optimized access paths? DBMS vendors do not publish the actual, in-depth details of the inner-workings of their optimizers, but a good optimizer will be cost-based. This means that the optimizer will always attempt to formulate an access path for each query that reduces overall cost. To accomplish this, the optimizer will apply query cost formulas that evaluate and weigh multiple factors for each potential access path, such as CPU cost, I/O cost, statistical information in the system catalog, and the actual SQL statement.

The Importance of Statistics

Without the statistics stored in the system catalog, the optimizer would have a difficult time optimizing anything. These statistics provide the optimizer with information about the state of the tables that will be accessed by the SQL statement that is being optimized. The types of statistical information stored in the system catalog include:

  • Information about tables including the total number of rows, information about compression, and total number of pages.
  • Information about columns including number of discrete values for the column and the distribution range of values stored in the column.
  • Information about table spaces including the number of active pages.
  • Current status of the index including whether an index exists or not, the organization of the index (number of leaf pages and number of levels), the number of discrete values for the index key, and whether the index is clustered.
  • Information about the table space and partitions.

Statistics are gathered and stored in the system catalog when a utility (such as RUNSTATS or UPDATE STATISTICS) is executed. Be sure to work with your DBA to ensure that statistics are accumulated at the appropriate time, especially in a production environment.