Systems today need to move fast to accommodate users who demand instant access. Data, both structured (parsed in a database) and unstructured (video and audio files) have exploded.
Unfortunately, more data also means it might take longer to find actually useful data. Your once-snappy database is now slogging through terabytes of data to fulfill the simplest requests.
As the industry embraces artificial intelligence and marches toward autonomous systems, some enterprises will still need to rely on traditional databases that must be managed, tuned, and secured to ensure the best performance. And while not all organizations will be able to make the autonomous transition quickly due to internal processes, there are a few tricks to optimize your old database with terabytes of data at speeds needed for everything from e-commerce to IoT and music streaming services on a global scale.
Of course, you’re likely already spending a few hours indexing your databases, but here are some other ways you can configure them to increase their speed. In order to illustrate these points, we’ll take a trip to the grocery store.
Design for Speed: Space Management
Step one: Plan ahead. An important aspect for loading—and querying—large amounts of data is the space management used for the underlying tables. The goal is to load data as fast and efficiently as possible while guaranteeing the physical data storage is not going to be detrimental to the future data access.
Step two: Extend space in big chunks, not just the incremental piece you need for the next task. There is an overhead cost with each space allocation which can impact performance, and unused space is often cheap. In the long run, it’s more cost effective to extend your space for the next two or three tasks at one time.
You can think of this in the same way as you would plan for a large weekly shop at a grocery store. When you walk through the door, you select a cart instead of a hand basket. A cart will ensure there is plenty of room for everything you need to purchase, and you will be easily able to access all of your items when you get to checkout.
Data Ingestion: Array Inserts
You can think of data ingestion as being analogous to putting items in your cart and paying for them all at once. You would never select one item at a time and pay for it before moving on to the next item on your list (which is analogous to a single row insert followed by a commit).
You walk through the store, collect all of the items on your list, and go to the checkout once you’re done (which is analogous to an array insert followed by a commit). The same is true when you want to insert data efficiently into the database.
So, take advantage of array inserts and commit only after each array insert.
Where you stage the data to be loaded is critical for maximizing the Input/Output (IO) throughput of a large data load. Because no matter what else you do, the IO throughput of the shared storage has a direct impact on the load speed, as data can never be loaded faster than it can be read.
Direct path loads typically use external tables, which enable external data (files) to be visible within the database as a virtual table that can be queried directly and in parallel without requiring the external data to be first loaded in the database.
Wherever you are staging the data, the bandwidth of that is what will control the speed of the load much more than the database itself. If the network can’t read any faster than 20 MBs/second, you can’t load any faster than 20 MBs/second.
Doubling Up: Parallel Execution
Parallel execution is a commonly used method of speeding up operations by splitting them into smaller sub tasks and assigning each to a different parallel process. Just as it’s more fun to go shopping if you bring helpers and split a large shopping list, you can speed up data IO by having multiple processes do the work.
Compressing Data Matters
Consider compressing data akin to organizing items in your cart. You can fit a lot more groceries if you spend a little time organizing them rather than just throwing them in randomly. Similarly, a little database organization goes a long way.
Loading data at scale always begs the question of whether or not to compress the data during the data load. At scale (meaning either billions of rows, or hundreds of gigs to multiple terabytes), the benefits of data compression are two-fold. The first obvious advantage is that you’re reducing the volume of space needed for storage, which can add up in costs at scale.
The second advantage is performance improvement. Because compressed data can improve query performance. But, since nothing in life is free, there is a cost in the additional CPU needed for the compression step during the data load. Will it be worth it to you? It’s a simple cost/benefit ratio. Assuming your workload is injected once and queried many times, the tradeoff is totally worth it.
Partitioning: Managing Terabytes of Data
Just as a grocery store is divided into different aisles, partitioning allows a table, index or index-organized table to be subdivided into smaller pieces. Each piece is called a partition and has its own name and storage characteristics. One of the key advantages of partitioning lies in its ability to greatly reduce the volume of data needed to answer a query.
So, if you’re going to the grocery store for a carton of ice cream, you head directly to the freezer section instead of walking up and down every aisle until you reach the ice cream. Partitioning can provide tremendous benefits to an ingest-heavy workload, including improving manageability, availability, and performance by reading only the data you need.
For example, a large utility company collects billions of rows of customer data via meters and I0T sensors. Monitoring has shown that once monthly bills are sent, the chances of it being queried are tiny. Of course, they need to keep it for regulatory reasons, so after three months, they throw it into a single partition and apply a higher level of compression to it, making it much cheaper to store.
Ready for the Store of the Future
These simple techniques can turn your database into a lean, mean, speed machine. Of course, similarly to how you can also order groceries online to avoid the grocery store completely, autonomous systems will eliminate the hassle of monitoring, scaling, and even updating through a couple of mouse clicks. Until then, these tips should hold you over until you’re ready for the store of the future.