Inside MongoDB Time-Series Collections

One of the big-ticket items at the recent MongoDB 5.0 launch was the introduction of specialized “time-series” collections. 

Time-series databases (TSDBs) have long been a specialized niche in the database landscape. The key concept in a time-series database is that, for a lot of data, the timestamp of the data is a critical element, both in terms of operational cost and analytic value. As predictive analytics has become a more widely used tool in business, the need to measure how a data item changes over time becomes more important. At the same time, the increasing importance of sensor data—the Internet of Things—has resulted in an explosion of timestamped information often being generated at very high frequencies.

Relational databases were not particularly well optimized for the sort of high-speed time-series traffic generated by IoT.  “Wide column” NoSQL databases such as Cassandra and HBase were able to accept higher rates of input based on their write-optimized distributed architectures. Dedicated time-series databases such as Open TSDB do exist, but these are not currently widely adopted. 

MongoDB is broadly deployed across a wide range of industries and has been put to use to store time-series-oriented data on many occasions. However, as with most databases, a bottleneck is eventually reached on write traffic, and the cost of storage can also be problematic. Aggregate queries on fine-grained time-series data can also be extremely resource-intensive.

MongoDB 4.0 customers would often work around these limitations by adopting a custom data model that reduced the IoT/time series overhead. Typically, some middleware layer would append multiple measurements into a single JSON document—sometimes called a “bucket”?and insert these in a single write operation. While this helped with insert overhead, it also created an awkward programming model and made analytic queries cumbersome and unnatural.

The aim of MongoDB 5.0 time-series collections is to provide the advantages of a bucketed data model while still providing familiar and simple programming patterns.  

Superficially, a MongoDB times-series collection looks like any other collection, although you must specify a timestamp attribute together with a “granularity” attribute which describes how frequently data is expected. Additionally, one can specify a metadata attribute that can be indexed for query purposes. Finally, data in a timeseries collection can be automatically purged after a certain amount of time.

Under the hood, MongoDB stores the data in buckets aligned as much as possible to the granularity setting.  This reduces the number of write operations required and stores the data in a more compact format.  The buckets also include maximum and minimal values for the bucket, which can be used to accelerate certain classes of aggregate operations. 

MongoDB 5.0 also introduced a new aggregation “Windowing” function. Window or “analytic” functions are one of the most powerful yet hard-to-learn parts of the SQL language. Window functions partition the rows in a result set and create a sort of “virtual table” that the function works with. The function operates on a “window” of rows around the current row, allowing you to access trends or group information. The MongoDB $setWindowFields function implements functionality corresponding to the SQL Windowing functions.

While window functions in MongoDB 5.0 can be used against any collection, they are particularly useful in the time-series context since they can be used for purposes such as moving averages and other time-oriented analytics. 

Currently, time-series collections are append-only, although you can update the underlying storage directly if you are feeling brave. There are also limitations in the initial release around indexing and availability of some advanced features such as change streams. These limitations are expected to be relieved in future releases.

Time-series collections represent another attempt by the MongoDB company to extend the use of MongoDB to a wider set of applications and scenarios. They are a welcome addition.