Analyzing Any Data, Anywhere, All the Time

Have you heard about stream computing? Basically, it involves the ingestion of data - structured or unstructured - from arbitrary sources and the processing of it without necessarily persisting it. Any digitized data is fair game for stream computing. As the data streams it is analyzed and processed in a problem-specific manner. The "sweet spot" applications for stream computing are situations in which devices produce large amounts of instrumentation data on a regular basis. The data is difficult for humans to interpret easily and is likely to be too voluminous to be stored in a database somewhere. Examples of types of data that are well-suited for stream computing include healthcare, weather, telephony, stock trades, and so on.

By analyzing large streams of data and looking for trends, patterns, and "interesting" data, stream computing can solve problems that were not practical to address using traditional computing methods. Another useful way of thinking about this is as RTAP - real-time analytical processing (as opposed to OLAP - online analytical processing). 

Consider a healthcare example. IBM and the University of Ontario Institute of Technology (UOIT) are using an IBM stream computing product, InfoSphere Streams, to help doctors detect subtle changes in the condition of critically ill premature babies.  The software ingests a constant stream of biomedical data, such as heart rate and respiration, along with clinical information about the babies.  Monitoring premature babies as a patient group is especially important because certain life-threatening conditions, such as infection, may be detected up to 24 hours in advance by observing changes in physiological data streams. The biomedical data produced by numerous medical instruments cannot be monitored constantly nor can a never-ending stream of values for multiple patients be stored long term.

But the stream of healthcare data can be constantly monitored with a stream computing solution. As such, many types of early diagnoses can be made that would take medical professionals much longer to make. For example, a rhythmic heartbeat can indicate problems (like infections); a normal heartbeat is more variable. Analyzing an ECG stream can highlight this pattern and alert medical professionals to a problem that might otherwise go undetected for a long period. Detecting the problem early can allow doctors to treat an infection before it causes great harm.

A stream computing application can get quite complex. Continuous applications, composed of individual operators, can be interconnected and operate on multiple data streams. Again, think about the healthcare example. There can be multiple streams (blood pressure, heart, temperature, etc.), from multiple patients (because infections travel from patient to patient), having multiple diagnoses.

Consider a second example: law enforcement. A stream computing application can monitor a stream of video data produced by a surveillance camera. Much of the stream will not be interesting. It becomes interesting when a person shows up in the video. The stream computing application can constantly analyzing the video stream, performing scene detection and face identification. When something "interesting" is found, that section of video can be captured and retained. And the face might even be matched automatically against a database of known criminals.

As mentioned earlier, the IBM product for stream computing is called InfoSphere Streams. It runs on xSeries blades (up to 125 x86 blades) using Linux. It is based on three main abstractions: the stream - bit pipes of data which can be subscribed to; operators - analytical calculation processors; and topology - the integration of streams to operators.

The data streams into the system, which is built as a series of progressing, cascading steps. Each step progressively refines the analysis looking for information, patterns, trends, and diagnoses. IBM's stream computing offerings and research is the result of more than 20 years of IBM information management expertise, 5 years of development by IBM Research, and more than 200 patents.

The ability to process millions of data points per second and perform advanced analytics on the data stream can help to usher in a shift in the way we manage and deal with vast amounts of data.

Not all data can be, or even needs to be, persisted in a database. The future is here and it might be time for us to re-think the way we do business ... by joining the stream.