Newsletters




Industry Leader Q&A with MathWorks' Heather Gorr

<< back Page 3 of 3

Since the stream can only process one second of data at a time, it’s also important that each second features the most information—and least amount of noise—possible. One common method is to use frequency domain features like the FFT and power spectrum, as in this case, which is analyzing the data based on recurrence rather than time.

What are the cultural and technological issues?

HG: Plenty of resources exist for comparing algorithms where the focus is on how streaming affects the model chosen. Models suited to forecasting and time series include:

  • Traditional time-series models (curve fitting, ARIMA, GARCH)
  • Machine learning models (nonlinear: trees, SVMs, Gaussian processes)
  • Deep learning models (multilayer perceptron, CNNs, LSTMs, TCNs)

Any of the above could work with the example, but when preparing models for streaming data there are multiple factors to consider. The training set generates data one second at a time, so the chosen algorithm must be able to match its speed while being robust to noise. The model must also be capable of updating itself over time as new data is entered, without having to be retrained on historical data. The model’s updates and predictions must be easily distributed and fast, which can significantly impact the algorithm chosen. It’s best to keep it simple when working with streaming data.

The multi-class fault detection example prioritizes entering the streaming prototype into production, meaning engineers have to choose and train a model quickly. A classification app is used to evaluate models, and a network editing app to export the most accurate one. A classification tree ensemble is designed to predict faults and regression to estimate the remaining lifetime.

With the model trained and validated, engineers can finally start integrating the streaming data. Each step—data preparation, model prediction, and model update—are performed by a function that accepts the data window and initial model as inputs and returns the predictions and updated model as outputs. With this signature in place, engineers are able to easily cache the model in-memory to support rapid updates while minimizing network latency.

What is most important in these processes?

HG: To put it all together, obviously, planning is crucial when working with high-frequency streaming data. Capturing requirements for data types, time window, and other factors that arise throughout the streaming process is helpful, and it’s important they be communicated during the development process. Standard software practices such as documentation, source control, and unit testing help facilitate development as well.

It’s also important to reduce code handoffs between teammates. Engineers may share their data preparation and modeling stages with a system architect, for example, who does not possess the same innate familiarity with the model. In the multi-class fault detection example, engineers used their code and model to create a library, which captured dependencies, and created a readme file to facilitate the integration steps. They also took advantage of the testing environment, running our code through a local host inside the live streaming architecture—helpful for debugging.

What is the payoff for organizations that can successfully create AI models capable of analyzing high-frequency streaming applications?

HG: Data scientists must always remember the importance of considering every system requirement before starting the data preparation and algorithm development stages, with those system requirements—and the need to update the model over time—influencing the choice of algorithm. Fortunately, and to the great benefit of employers and collaborators, many useful methods are available. While it can be challenging to build AI models capable of analyzing high-frequency streaming applications, the sheer amount of data available for processing makes the endeavor more than worth it.

<< back Page 3 of 3

Sponsors