Modern Analytics for the IoT Age

There is no doubt that the Internet of Things (IoT) is increasingly transforming the way we work and live. Gartner predicts that the number of connected things worldwide will jump from 6.4 billion in 2016—with some 5.5 million new things getting connected every day—to reach 20.8 billion by 2020. IDC is even more bullish, projecting that 32 billion IoT devices will be in use by 2020. And all those billions of IoT devices are generating a veritable tsunami of data.

Consider the 2014 Distributed Event-Based Systems (DEBS) Grand Challenge in which just 40 houses with 2,000 sensors generated roughly 6 billion events in 4 months. If you were to increase the number of sensors to 2 million, you would end up with some 6 trillion events in the same timeframe. In that context, IDC’s prediction that 32 billion IoT devices will generate a massive 44 zetabytes of data in 2020 sounds quite plausible.

IoT-driven solutions hold tremendous promise for improving customer relations, operational efficiency, health, safety, and privacy. And the greatest power lies in using IoT-derived insights to respond to opportunities or threats immediately. However, enterprises largely have focused on historical reporting and will need to significantly modernize their analytics capabilities—both in understanding current events and predicting future outcomes—to take advantage of the new insights that IoT data can bring.

The following are 5 characteristics of modern analytics optimized for IoT environments:

1-IoT Analytics Are Distributed

Most enterprise IoT environments are like spider webs: distributed structures connecting a myriad of sensors, gateways, and collection points, with data forming the threads that bind them together. This places a couple of key demands on analytics platforms.The first is communications. IoT structures often utilize low-energy and wireless networks, so analytics platforms need to support the flow of data across a range of protocols. Four of these are particularly important. Message Queuing Telemetry Transport (MQTT) is a highly efficient and lightweight protocol that has gained a lot of market traction in IoT implementations. The Common Open Source Publishing Platform (CoApp) is an alternative to MQTT, which is widely adopted for machine-to-machine communications. Zigbee is a very energy-efficient protocol that is used to make connections between IoT devices and the hub. Best known is Bluetooth low energy (BLE), given the broad use of Bluetooth-enabled consumer devices.

Second, the dynamic quality of IoT implementations means analytics solutions should have the flexibility to expand or contract to match the load. Deploying analytics in the cloud is one option. However, many IoT deployments have on-premises aspects, such as machines on the factory floor or kiosks in stores. Therefore, an IoT analytics solution may need to scale across a hybrid environment leveraging both the cloud and on-premises systems.

Additionally, the software must have a distributed architecture with the ability to run multiple queries across multiple systems—and scale while doing it. Hand in hand with this, analytics solutions should support multithreading to fully harness the processing power offered by four-core and eight-core servers to handle large data volumes.

2-Some Analytics Need to Occur at the Edge

In large IoT deployments, there can be billions of events streaming through each second. However, many businesses only need an average over time, or insights into trends that exceed established parameters. For example, a minor subsystem may need to trigger an alarm when the temperature drops below a certain threshold. It doesn’t need to know the temperature at every single second of the day; all it needs to know is when this particular incident occurs.

The answer is to conduct some analytics on IoT devices or gateways at the edge and send aggregated results to the central system. This facilitates the detection of important trends or aberrations, such as temperature changes or failed access attempts. At the same time, it significantly reduces network traffic to improve performance.

Such edge analysis requires very lightweight software, since IoT nodes and gateways are low-power devices that limit the available strength for query processing. To address this challenge, more analytics solutions are utilizing Apache Storm, which offers a small footprint suited to the cloud and gateways, for real-time processing. Additionally, several companies are working on edge analytics products and reference architectures. Still, because edge computing is heavily contextual, there is no one-size-fits-all solution.

3-Streaming Analytics Drive Real-Time Insights

Since IoT data is essentially streams of events, it plays a critical role in providing the insights to support real-time interactions, whether triggering a thermostat or a fraud alert. As a result, IoT deployments typically require some form of complex event processing (CEP) and streaming analytics. The software should handle time-series data, time windows, moving averages, and temporal event patterns.

A number of open source technologies have emerged to address the demands of streaming and real-time analytics. Apache Storm is perhaps the most widely used streaming analytics engine. Meanwhile, Apache Spark and Apache Fink each offer a single programming model for handling both streaming analytics and real-time event processing, as does the cloud-based Google Cloud DataFlow. Additionally, there are a range of CEP platforms that support the ability to detect complex temporal queries. With each offering, there are trade-offs, so an IoT implementation’s specific requirements will determine the technology approach.

IoT-driven solutions hold tremendous promise for improving customer rela­tions, operational efficiency, health, safety and privacy, but the greatest power lies in using IoT-derived insights to respond to opportunities or threats immediately.

Hand-in-hand with streaming analytics, some enterprises are turning to time series databases (TSDBs) for their IoT implementations. The databases require timestamps on all data and are capable of writing data within milliseconds. Examples of TSDBs include OpenTSDB, InfluxDB, and Google KairosDB, and they are typically used in conjunction with SQL or NoSQL databases.

4-IoT Data Comes With Uncertainty

How inbound IoT data is ordered can be extremely important. For example, a progression of events may indicate that an engine part is heading for failure. At the same time, tremendous numbers of nodes are pushing data through low-bandwidth IoT networks. Sometimes those nodes fail, creating issues about whether sensors should keep data and send it later. Other challenges can include collection latency, duplicate messages, and reliability.

IoT analysis utilizing time windows and temporal sequences will require dedicated rule sets and queries to ensure the proper order of inbound data. However, there is no commercial solution or open source project that developers can apply to their systems. Instead, many IT organizations will need to develop custom rules and queries to support the specific requirements of their IoT analytics implementations. That said, there are proprietary technologies used internally, such as Google Millwheel for fault-tolerant data stream processing, that can provide examples of how to address these demands.

5-Prediction Boosts the Power of IoT Analytics

Predictive analytics are pervasive on the web, from Google’s search engine to those ubiquitous pop-up ads targeting consumers’ recent searches and purchases. Increasingly, they are playing a central role in maximizing the effectiveness of IoT applications, from fraud detection to proactive engine maintenance, and nearby restaurant recommendations, to name a few.

Traditionally, statistical models have served as the bedrock for predictions. However, machine-learning algorithms are emerging as a strong alternative—sometimes working in conjunction with statistical models and sometimes replacing them altogether. Notably, machine-learning algorithms can handle extremely large volumes of data, and they can automatically learn from the data, unlike rules-only systems that require professionals to watch rules and evaluate their performance.

Several frameworks for machine learning have emerged in recent years. These include Apache Spark MLlib, Dato GraphLab Create, and Skytree. Among these, Spark MLlib, the highly scalable Apache Spark machine-learning library, has the largest community and continues to see rapid adoption. Dato GraphLab Create, which is written in C, provides fast performance and scalability. Meanwhile Skytree is recognized for its superior algorithms, but the software is proprietary and therefore requires a higher upfront investment. Two other machine-learning algorithms, Facebook Torch and Google TensorFlow, come from a deep-learning heritage, and, at least today, they seem better suited to web applications than IoT solutions.

These are just a few of the available options. Some software companies are building solutions on top of open source machine-learning algorithms while other organizations are developing new algorithms all together. More research is needed, but a thorough evaluation of a company’s IoT scenario can help in determining the best alternative.

One last thought:

The market for IoT analytics technologies is still nascent. So adopting a flexible and open architecture for today’s analytics challenges will best position an enterprise to capitalize on emerging technologies in this arena tomorrow.

 Image courtesy of Shutterstock