What is Large?

Every now and then some sage consultant will offer advice like "Large tables should be partitioned" or "Be sure to use static SQL for your applications with high volume transaction workloads." But how useful is this advice? What do they mean by large and high volume? Terms such as these are nebulous and ever changing. Just what is a large database today?

Of course, the standard answer of "it depends" applies - it depends on your site, your mixture of data types, and your particular requirements. But is that any more useful? Let's try to define the term and put some hard numbers around it.

First of all, this question begs to be asked: "Large in terms of what?" The first thing that springs to mind is the actual size of the database. But are we talking about a single table or an entire database? And sometimes we are even referring to the size of an entire subsystem or instance (such as for SAP implementations).

If we are at the table level, what is the granularity of measurement for determining what is large? Do we talk in terms of number of rows or number of pages (or blocks)? Or just the amount of disk space consumed? And do we count just the base data or add up the space used by indexes on that data as well?

And what about the type of data? Is a 20GB database consisting solely of traditional data (that is, numbers and characters; dates and times) bigger than a 50GB database that contains non-traditional BLOBs and CLOBs? From a purely physical perspective the answer is obvious, but from a management perspective the answer is more nebulous. It may indeed be more difficult to administer the 20GB database of traditional data than the 50 GB database of large objects because traditional data consists of more disparate attributes and is likely to change more frequently.

Another issue is just what are we counting when we say we have a large database? Do we count copied and denormalized data? And what about free space? There are two schools of thought: one says if it is in a database file, then it counts. Another says, let's only count the core data. From the perspective of the DBA though, you have to count everything that needs to be managed - and doesn't everything need to be managed?

Some interesting data points can be found in the report "Managing the Rapid Rise in Database Growth: 2011 IOUG Survey on Database Manageability," produced by Unisphere Research and sponsored by Oracle. According to this study, "More than one-third of companies in the survey report their data stores are expanding at a rate greater than 20 percent a year." But this tells us what we already know, that we are storing more and more data all the time. It doesn't really tell us what is "large."

Winter Corp. used to conduct an annual study of the largest production databases, but they have not produced a new study in several years now. The most recent was produced in 2003, at which point Winter reported the largest database to consist of 828TB of data. In terms of workload, 51,448 transactions per second (tps), was the highest figure. And the largest number of rows for a database was 496 billion rows. The study also tracked normalized data versus denormalized data, and the largest database of normalized data was 94.3TB in size. Indeed, these are all very large - and likely well beyond the scope of what a "normal" IT environment would call large.

So you need to be prepared with the criteria for what establishes database "largeness" at your shop. Is it a management issue? A planning issue?  It better be both of those, but sometimes it is a braggadocio issue, too!  You know, being able to say "My database can beat up your database."

Furthermore, the granularity of the object being discussed is important, too. Although the Winter Corp. research is at the database level, most DBAs will be more interested in managing at the table or table space level. So what is a large table? Well, for DB2, the DBMS I know best, a good place to start is probably 4GB. To specify a value greater than 4GB, the data sets for the table space must be associated with a DFSMS data class defined with extended format and extended addressability. DFSMS's extended addressability function is necessary to create data sets larger than 4GB in size. Of course, depending on your shop and its requirements this might be too high ... or even too low.

So, here we are, near the end of this column and we seem to have more questions than answers. So, how about some advice? First of all, when determining what a large table is for your shop, do it in terms of the number of pages, not the number of rows. You can use this number to easily compare the size of one table space to another, whereas you cannot if using number of rows because row size can vary dramatically from table to table. And count everything that is being persistently stored in the database: data, indexes, free space, etc. If it is being stored it must be managed, and therefore impacts TCO. Stripping out everything but normalized data only matters when you are worrying about who has the biggest database, and we should be more worried about assuring the availability and manageability of our big databases!

The Bottom Line

One thing can be said for sure, though - and that is this: Our databases are getting bigger. These days we talk more frequently in terms of terabytes than gigabytes, and soon we will be comfortable talking about petabytes. Who said life as a DBA was boring?