There’s no question that investing in data systems and infrastructure can make organizations more competitive and allow for new, exciting innovations. This makes every company a data company. But recently, the maxim has come into sharper focus. The big competitive advantage doesn’t come from data-at-rest; instead, it comes from streaming data.
Think of Facebook delivering relevant ads in real time based on a user’s clickstream data or consider the fact that companies can make real-time pricing adjustments (for example, surge pricing with Uber or Lyft) based on demand and other factors. These are just two examples of how companies can tap into the firehose of real-time data they are generating to create transformative services, products, and customer experiences. Today, streaming data—spanning IoT, financial trades, clickstream data, and the list goes on—plays a role in almost everything innovative companies do.
Streaming data presents a tremendous opportunity, but it also creates a host of new challenges. First, ingesting the high volume of data, distilling it down into useful parts, and routing it to a new breed of applications require new and specialized designs.
Second, simply capturing data at an increasing rate provides little to no advantage. Enterprises also must process the data. Structured Query Language, or SQL, has been the de facto standard for querying for decades, but it is necessarily different when applied to real-time streams. To run SQL on the boundless stream of data, the continuous SQL grammar has some important differences.
Capturing and Storing Events
There are several components to any event streaming architecture. The first consideration is a system for input/output and storage of streams of data. Apache Kafka, Pravega, Pulsar, and NATS are relatively simple, scalable, and high-performance architectures for this purpose. They serve as a way to acquire events from IoT sensors or other events such as financial transactions and offer durable storage and simple APIs. They have become the new enterprise data bus.
Unfortunately, they are only part of a robust event streaming architecture. Getting the data out of these systems and delivering it to data-hungry applications is problematic. Although Kafka, for example, has a robust set of APIs, low-level programming and a specialized knowledge of the tricky semantics of streaming data are still required.
Processing Streams
The next ingredient in the architecture is the stream processor, or the ability to compute on data directly as it is produced or received. Upon receiving an event from the stream, a stream processing application reacts to that event. It may trigger another action, for example.
Stream processing frameworks present an API for writing computations against the rapid flow of data and are built to manage state and provide fault tolerance and a framework for scalability. Examples of stream processing frameworks include Apache Flink, Samza, and Storm, which are able to manage state, handle inter-process communications, provide high availability/restart-ability, and enable the scalability of groups of jobs. These frameworks are highly powerful but also quite complex in initial code generation and ongoing maintenance.
Adopting Continuous SQL
As a starting point, SQL is well-suited for stream processing. It is a declarative and expressive language that enables sophisticated real-time analytics to be expressed using simple queries. It is pervasive, too. Most enterprise developers know and understand SQL, and, for those that don’t, it’s easy to learn. SQL also simplifies real-time analytics by encapsulating the underlying complexity of the data management operations. Developers, data engineers, data scientists, and others do not need to use complicated, low-level APIs to create processors. They can create them using SQL. More importantly, they can issue SQL against the stream of data and receive feedback right away. This allows them to explore and reason about the data stream itself using a familiar paradigm.
With traditional SQL, queries are run and then return a result set. With continuous SQL, data flows through a perpetually running query. Each time new data arrives, the query incrementally evaluates its result, efficiently recomputing the current state.
Continuous SQL, sometimes called Streaming SQL, StreamSQL, Continuous Query, Real-Time SQL, or even “Push SQL,” builds upon SQL for stream processing. Continuous SQL should be familiar to anyone who has used SQL with an RDBMS, but it does have some important differences. In an RDBMS, SQL is interpreted and validated, an execution plan is created, a cursor is spawned, and results are gathered into that cursor and then iterated over for a point-in-time picture of the data. This picture is a result set and it has a start and an end. The phasing is described as parse, execute, and fetch.
In contrast, Continuous SQL queries ceaselessly process results to a “sink” of some type. The SQL statement is interpreted and validated against a schema. The statement is then executed, and the results matching the criteria are continuously returned. Jobs defined look very similar to regular stream processing jobs. The difference is that they were created using SQL versus something such as Java, Scala, or Python. Data being emitted via Continuous SQL delivers continuous results. There is a beginning, but no end.
Running SQL against boundless streams of data requires a new way of thinking about programming. While much of the SQL grammar remains the same, how the queries work, the results that are shown, and the overall processing paradigm is different than traditional RDBMSs. Filters, subqueries, joins, functions, literals, and myriad useful language features generally exist, but they may have different behaviors. New constructs around time, and specifically “windowing,” are introduced. Mastering these constructs enables new platforms that are more broadly available to a wider range of individuals. For example, SQL enables organizations to offer self-serve platforms such as Netflix Keystone and Uber AthenaX that allow business users to take advantage of streaming data.
Materializing Results
Another key difference between Continuous SQL and traditional SQL is the way the results of the query are handled. In traditional SQL, a result set is returned to the calling application as a cursor consumes the results. Using Continuous SQL, the results are continuously returned, as mentioned earlier, to sinks. Sinks can be streams of data such as Kafka topics, or sinks can be more traditional systems such as an RDBMS. More than one stream can utilize the same sink, sometimes joined with other streams.
It’s extremely common to have some sort of persistent storage at the end of a stream of data. This makes it easy for developers to consume the data from an application. Perhaps a JavaScript application is plotting vehicle locations on a map using streamed geolocations. In this scenario, it pulls the locations over HTTP at periodic intervals. Ultimately, creating a materialized view on a stream (materializing the results of a query) is one of the important, tricky, and often-overlooked parts of a streaming data pipeline, but it is absolutely critical to success.
For example—IoT sensors transmit continuously updated event data for the same parameter (for instance, lat/long). This materialized view gives you the latest message containing lat/long and not the coordinates for previous locations. This better enables companies to get a current snapshot of their data.
The Payoffs Are Worth the Effort
Although the technological challenges of harnessing streaming data are daunting, the results are worth the effort. Event-centric thinking unlocks a host of new capabilities and applications that were previously not possible.
Most of what goes on today in the business world is based on next-generation applications that tap into everything from IoT data and credit card transactions to data about network security threats and online activities. Incorporating streaming data into products and services is the future of product innovation. Solutions based on Continuous SQL enable organizations to build next-generation applications that incorporate any type of streaming data, creating huge competitive advantages across a wide variety of industries.