I can’t remember the first time a database or storage vendor told me, “Disk is cheap,” but it was probably in the 1990s. Vendors like to say disk is cheap because it helps them sell more of it and to encourage bigger deployments. The fact is that data storage is getting cheaper all the time. When I entered the business, one GB of storage cost about $1000. Today, it’s more like 10 cents—10,000 times less!
On the other hand, the amount of data enterprises store has increased just as dramatically. And no matter how cheap storage might be compared to the past, the simple fact is that the cost of storage is a big part of running a database—especially a cloud database where you pay for the storage, not just once, but every month.
There’s always been a trade-off between performance and cost. In the days of spinning disks, we’d often configure more disks that we needed for the data so that we could have greater performance from multiple devices. In the cloud era, we can choose between super high-speed SSD storage and super low-cost archival storage. On AWS, the fastest possible storage might cost 1,000 times more than the cheapest possible storage.
In databases, we often have a mix of “hot” and “cold” data. Hot data is accessed frequently, and performance is often critical. Cold data is accessed infrequently, and it might not be a concern if retrieval times are slow. Therefore, many databases offer a “data tiering” solution that allows you to place hot data on high-speed, expensive devices and cold data on slower, cheaper storage.
MongoDB Atlas Online Archive is MongoDB’s data tiering solution.
Online Archive builds off MongoDB Data Lake technology, which we first looked at in 2019. The Data Lake is essentially a MongoDB query layer on top of raw JSON files stored in AWS S3 or Azure Blob Storage. You can query these objects as though they were in a MongoDB database. However, you can’t update them, and the query performance is dramatically degraded compared with a native MongoDB cluster.
Online Archive leverages Data Lake technology by automatically migrating old data from a MongoDB cluster to a Data Lake and providing a data federation layer on top of the two, which allows queries to combine data from both sources. The result is a MongoDB cluster in which data “ages” out of expensive database storage to lower cost cloud storage.
Data to be archived is identified either by a date attribute, which allows for simple “archive after X months” logic, or a custom query that can implement more specific logic. This logic is implemented on a per-collection basis.
There are some definite limitations with archived data: it can’t be updated, is slower to retrieve, and has no indexing options. There is, however, the ability to partition archive files by attributes so that (for instance) you could ensure that all the archived data for individual customers ends up in separate buckets. That will ensure that queries that look for a specific customer do not have to scan buckets belonging to other customers.
The online archive is not entirely transparent to the application. Once an archive is created, you will have three connection strings. One connection string allows you read/write access to the core database. You need to use this connection string to perform inserts, updates, and deletes. Another connection string gives you read-only access to a combination of online and archived data—you’d use this one for reads that need to combine the two. The third connection string allows read access to online data only.
Cost savings from the use of Online Archive will vary depending on the amount of data archived, but in one cited case, a MongoDB customer achieved a 60% reduction in data storage costs. Storage may be cheap, but Atlas Online Data Archive makes it even more affordable.