Relational database management systems (RDBMS) rely on an optimizer (or relational optimizer) that transforms SQL statements into executable code. Before any SQL statement can be run by the RDBMS, the optimizer must first analyze the SQL and determine the most efficient access paths available for satisfying the statement. It accomplishes this by parsing the SQL statement to determine which tables and columns must be accessed. The optimizer next examines statistics stored in the system catalog to determine the best method of accomplishing the tasks necessary to satisfy the SQL request.
Statistics used by the optimizer include information about the current status of the tables, indexes, columns, and table spaces (including partitioning information) that need to be accessed. Examples of typical statistics stored in the system catalog include table (space) size, clustering details, number of rows, number of distinct values for columns, index levels, and so on. The optimizer plugs this information into a series of complex formulas that it uses as it builds optimized access paths.
In addition to the system catalog statistics, the optimizer will take into account other system information, such as the CPU being used, DDL options, and the amount of memory available. This allows the optimizer to estimate the number of rows that qualify for each predicate, and then use the proper algorithm to most efficiently access the required data.
The ability to optimize data access is a very powerful capability of relational database systems. Instead of arduously coding the exact instructions to retrieve the necessary data, access instead is achieved by telling the RDBMS what to retrieve and letting it figure out how to retrieve it. Regardless of how the data is physically stored and manipulated, Db2 and SQL can still access that data. This separation of access criteria from physical storage characteristics is called physical data independence, and it is only possible because of the Optimizer.
If you think about this for a moment, it is quite extraordinary. For years application programmers hard-coded access to data into their application programs. This means that the RDBMS can take over this task, at least when it comes to accessing database data. Moreover, when the amount and type of data in the database changes, the optimizer can formulate new and different access paths that conform to the changed data. And it can do so without involving a programmer writing a single line of code.
If indexes are removed, the RDBMS can still access the data (albeit less efficiently). If a column is added to the table being accessed, the data can still be manipulated without changing the program code. This is all possible because the physical access paths to the data are not coded by programmers in application programs but are generated by the relational optimizer.
Compare this with older, legacy data manipulation mechanisms (such as VSAM, IMS, and flat files), in which the programmer must know the physical structure of the data. If there is an index, the programmer must write appropriate code so that the index is used. If the index is removed, the program will not work unless changes are made. This is true also for some of the newer NoSQL databases, too. But with an RDBMS and SQL you get flexibility because of the capability to optimize data manipulation requests automatically.
Physical data independence is the primary reason that it is a best practice to put as much work as possible into SQL statements, instead of into your host program logic. When the data and its characteristics change, the optimizer can formulate another access path for the SQL “on the fly” if needed. If instead, host language code is used, programmers must rewrite that code to change access requirements.
The Optimizer is an Expert System
Essentially, the optimizer works like an expert system. An expert system is a set of standard rules that – when combined with relevant data – can return an expert opinion. For example, a medical expert system takes the set of rules determining which medication is useful for which illness, combines it with data describing the symptoms of ailments, and applies that knowledge base to a list of input symptoms. The optimizer renders expert opinions on data retrieval methods for SQL queries based on the relevant data stored in its system catalog.
The Bottom Line
The relational optimizer is a very complex component of the RDBMS that we too often take for granted. The optimization techniques of the major RDBMS products continue to be improved with every new release, too. Relational optimization has saved countless hours of work and as long as we use it properly and code our applications with knowledge of what optimization can do, the RDBMS can be used for a wide variety of requirements and use cases. Don’t lose sight of that as you wend your way through the hype out there regarding new types of database systems.