Every database platform is architected for its initial use case. The progression over the decades of new database platforms with new capabilities has been driven by more than just the changing landscape of computer hardware. It also reflects the increasing diversity and complexity of use cases for database platforms, combined with an accelerating operational tempo for business and an insatiable appetite for scale and performance in IT.
The emergence of the Internet of Things is the driving force behind the latest evolutionary change in the way we think about database platforms and the most important change since the development of the original “big data” technologies a decade ago. The Internet of Things is the collective term for machines, sensors, and mobile platforms that autonomously measure and analyze their environment.It generates data as diverse as the location of mobile phones, weather patterns over a country, sensor readings from jet engines, the temperature in your home, and satellite images of Earth. The process of continuously capturing and analyzing these data sources contextualizes the real-world environments in which events occur and enables insights into the observed behaviors.
As the cost of sensor platforms plummet, it is creating a world where organizations can have ubiquitous operational context at their fingertips. The ability to contextualize almost any operational scenario with live sensor data, immediately derive insights from that data, and make critical business decisions from this understanding of the world is unprecedented.
Leveraging data from the Internet of Things is expected to create trillions of dollars in new value across the global economy every year. Second only to the profound impact the Internet of Things will have on business is the impact it will have on the database and analytics infrastructure that manages this tsunami of data.
All of this new data has attributes of “where” and “when.” This is not unique to the Internet of Things; these attributes are inherent in that all data is ultimately generated by events that happen in the real world. A point-of-sale transaction happens at a place and time, a phone call connects two places at a point in time, and even events that occur within the virtual confines of the internet, such as posting content on Facebook, are ultimately created somewhere in reality at a particular moment.
However, while the attributes of “where” and “when” can provide valuable context for many applications, they have rarely been essential to extracting value from most enterprise or internet data models. Until now, there has been little demand for large-scale database platforms optimized for managing and analyzing spatially related sensor and location data.
The Internet of Things, on the other hand, is specifically about enabling immediate analysis of the real world for a wide range of purposes, from understanding consumer behavior to optimizing oil and gas production. Analysis of this data necessarily reflects the intrinsically spatial nature of the world we live in. The “where” and “when” are not just coincidental attributes of the data that might be stored in the database, they are the primary attributes that drive high-value applications. Contextualization of events is unavoidably tied to locality.
The weather in Los Angeles is unlikely to influence commuter traffic in San Francisco or social media sentiment in New York, but in Los Angeles, it will invariably influence both.
Analyzed in this manner, the Internet of Things will allow us to make sense of the world around us.
Whether it is tracking the flow of populations and commerce throughout a city in order to understand complex consumer behaviors, using remote sensing platforms to detect risks in agricultural supply chains, or combining aircraft sensor data with atmospheric measurements to optimize the fuel economy of flights in real-time, the ability to immediately measure and respond to changes in the environment will create incredible value.
The challenge for these applications is selecting data infrastructure that can fully take advantage of Internet of Things data. A long-standing principle of database design and architecture is that you organize your database internals in order to optimize your most important workloads, data relationships, and queries.
The selected optimizations are, in a sense, trade-offs that sacrifice suitability of the platform for applications far outside the intended use case. We have many database platforms highly optimized for traditional enterprise data models, and we have many big data platforms that are optimized for the way web content and social networks are organized.
To fully leverage the Internet of Things though, you need big data platforms that are organized and optimized for the way reality is organized. In other words, you need a big data platform that is designed for data models in which the most important attribute of every record is “where” and “when.” Internet of Things applications have multiple characteristics that are broadly unprecedented in the database world. Together, these applications are driving the development of a new class of large-scale analytical database platforms targeted at spatial data models.
There are three essential ingredients to building high-quality Internet of Things applications:
1. High-Velocity Data Ingestion
For as much data as humans generate, we have significant limits. We can only buy so many things and click so many web links every day. In fact, if you look at any global human-generated data source, whether text messages, tweets, or credit card transactions, the average data rate for the entire human race typically tops out at hundreds of thousands of records per second. Most of our database and big data platforms were optimized for human-generated data rates because historically this was, with few exceptions, the only type of data we captured. By contrast, machine-generated data sources common for the Internet of Things continuously create new records at rates of millions to billions of records per second.
At these data rates, an enormous volume of data must be stored for analysis such that keeping things in memory for easy speed is not an economical option. A petabyte per day is not unusual. Few database platforms were designed to parse, process, index, and store millions of records per second on a continuous basis, particularly when the data is spatial in nature.
2. Spatial Data Models and Analytics
In databases, you achieve scalability by organizing your data around the questions you intend to ask. For the Internet of Things, both the data and the questions are spatial in nature. Your core data types are things such as vectors that capture human motion in cities or complex polygons from real-time weather sensors. To understand how one type of event influences another, such as weather on human behavior, you do spatial joins across these various sources.
Few platforms are purpose-built for spatially organized data models. Many enterprise databases offer spatial options, but these were designed for GIS use cases with small data sets that change slowly. A typical Internet of Things data source often creates multiple orders of magnitude more data each day than these spatial indexes were ever designed to support. Popular big data platforms such as Hadoop and Spark were never architected to support the requirements of spatial data models or spatial analytics as that is far outside their intended use case.
3. Real-Time Query Execution
Internet of Things applications are often for an operational environment where the realizable value from data is highly perishable. What is actionable information now may no longer be useful in 5 minutes or even 5 seconds. Consequently, the ability to make immediate decisions based on the freshest possible data is critical to maximizing the value of these data models.
The term “real time” is one of the most abused in big data. It does not mean quick access to stale data. It does not mean immediate storage of data that will be analyzed with a slow batch process. Real time is about operating at the speed of reality. It is characterized by the amount of time that passes between data being available for ingestion and that data being reflected in an application.
Many big data systems have a fast query capability but only if the data is static, slow moving, fits in memory, or can be summarized. Real-time query execution on fast-moving data is achievable but it requires an architecture purpose-built for that workload. At the confluence of these three architectural threads are real-time spatial data platforms perfectly suited to the rich scope of applications that can leverage Internet of Things-generated data.
Spatial Databases Will Become Increasingly Important
As contextual computing and the use of live sensor data grows, spatial databases capable of supporting the creation of real-time operational data models will become an increasingly important part of enterprise infrastructures, whether in the cloud or on premise. These new platforms will replace neither the traditional enterprise database nor the existing generation of big data platforms such as Hadoop; those platforms fill a valuable role in the enterprise.
Instead, platforms designed to handle the spatial workloads of Internet of Things applications will become a new and essential component of every enterprise infrastructure.
Subscribe to Big Data Quarterly.
Image courtesy of Shutterstock.