Going Real-Time with IoT and Stream Processing: Strategies and Technologies

As the world becomes increasingly more demanding in the ways it consumes any form of content, data is no stranger to this phenomenon. The emphasis on real-time, streaming data, as well as IoT, has dictated what many organizations seek in terms of operational intelligence. Due to technical capabilities, however, these enterprises are ill-equipped to deliver in the real-time arena.

Experts joined DBTA’s webinar, “From Edge to Cloud: IoT Data Management and Stream Processing,” to explore how new approaches to data management can mean the difference between relying on archaic infrastructures and adopting agile, flexible, and scalable data processes.

Nima Negahban, co-founder and CEO of Kinetica, pointed toward the potential value lying in IoT data. The initial phase of developing an IoT infrastructure, which relied on sensors, networking, and device management, derived value from connectedness.

While this was a critical starting point, the next phase of IoT, which leverages data management, ML feature engineering, and ML inference, surfaces value from fusing data streams into actionable intelligence that drives industry transformation.

Similar to the beginnings of IoT, traditional data pipelines delivered some value that connected data sources to data consumers. Utilizing pre-processed data to improve end-user performance, traditional data pipelines automated data collection and provisioning, ultimately offering faster query speeds.

However, these data pipelines were not free of challenges; the time consuming and labor-intensive nature of its builds, the abundance of stale data, low agility due to denormalization, and a continuous “out of order” state proved that though data pipelines offered some value, they produced a myriad of complications.

For these reasons, Negahban argued, traditional data pipelines are dead. This efficiency gap led to innovation for Kinetica, where their next generation compute paradigms provide increased end user performance without the need for tedious data pipelines. Keeping data fast and fresh, accompanied by a full corpus of data at a lower TCO, Kinetica streamlines the consumption of data.

At the core of this innovation is native vectorization, which parallelizes the data within each node. This force multiplier results in order of magnitude performance improvements on a smaller compute footprint, while removing complex data engineering commonly required by other databases to make up for their inefficiencies.

George Trujillo, principal data strategist at DataStax, organized his approach to streamlining data management within three areas: change accelerators, real-time technology stack impacts, and the real-time AI vision and execution strategy.

Trujillo dove into change accelerators by offering a few stats; according to Gartner, “Unstructured data represents an astounding 80-90% of all new enterprise data, and it's growing 3X faster than structured data,” and, “By 2027, over 90% of new business software applications will contain ML models.” These viewpoints highlight the ever-ramping interest and reliance on real-time technologies to increase the efficacy and impact of data, regardless of its structure.

Real-time data must be partnered with advanced technologies, including sensors, intelligent devices, auto responses and alerts, edge computing, IoT, mobile devices, and more; these technologies further demand a load of requirements to maintain their operational efficiency, including scalability and data quality.

The real-time technology stack, Trujillo continued, unifies event streaming data, operational data, and ML features data within a low latency, scalable operational data store. This enables data to flow as a data supply chain in real-time; the real-time pipeline, which starts as data sources (sensors/devices/IoT), moves to data-in-motion (data ingestion platform), then to data-at-rest (operational data and feature store), finally moving to the analytics platform.

Where the real-time AI mindset enters the stage is through DataStax’s real-time data ecosystem of technologies, according to Trujillo. These solutions ultimately drive self-managed and managed streaming/analytics services to help organizations envision AI enabling real-time data. The following are powered by Apache Pulsar:

And the following are powered by Apache Cassandra:

  • Astra DB, the managed, serverless, NoSQL service
  • DataStax Enterprise, the self-managed cloud or on-prem integrated search, analytics, and graph platform
  • K8ssandra, an Apache Cassandra ecosystem acting as a self-managed cloud or on-prem solution, running in any Kubernetes environment
  • Stargate API, the data API gateway

Rachel Pedreschi, head of technical services at Decodable, argued that real-time requirements demand a modern data architecture. While the future of IoT data is looking toward real-time streaming, accompanied by its extensive list of requirements, the lateness of data becomes a paramount issue.

Apache Fink, a framework and distributed processing engine for stateful computations over unbounded and bounded data streams, stands out for IoT data processing, according to Pedreschi. With effective time semantics, advanced features for messy data, performance and scalability guarantees, as well as support for both stateful and stateless processing, Apache Fink is designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.

Pedreschi further explained that while there is a plethora of open source projects and cloud services that can ease the pain of data lateness, including MQTT, Kafka, and the aforementioned Apache Fink, leveraging these tools in tandem is far more difficult than it seems.

Decodable innovates on this complex and lengthy process, making the build of real-time applications and services take minutes, not months. With zero infrastructure, rapid development, a connector catalog, and intuitive abstractions, Decodable offers a real-time stream processing platform built on Apache Fink that requires no cluster set-up or code writing.

Decodable allows its users to quickly build processing pipelines using SQL, eliminates the need to learn multiple open source technologies, and supports stateful and stateless streaming pipelines with fully integrated monitoring and alerting. Ultimately, the Decodable platform drives accessibility for real-time data stream processing with fully managed underlying operations and simplistic development and implementation.

For an in-depth discussion regarding strategies and use cases in achieving real-time data, you can view an archived version of the webinar here.