Designing a Streaming Architecture For High-Frequency Sensor Data
Time is precious, short, relative and complicated… especially when managing streaming applications, where calculations are performed in near-real-time. Even more challenges arise when data come from sensors and are sampled at different rates and high frequencies. Machine and deep learning algorithms are often incorporated and have different mathematical assumptions when it comes to time.
When building streaming applications, teams can spend a significant time backtracking when these assumptions are not met throughout the system. Careful design of the architecture is important, especially with respect to data prep, signal analysis, and machine learning.
This post will discuss common challenges in designing a system for streaming predictions on sensor data… with an eye on the time.
Consider the example of a pump with sensors measuring pressure, volume, and fluid flow in real-time. You can train a machine learning model with the sensor data to predict failures and the remaining useful life of the pump. Since the data are streaming continuously, potential problems can be predicted and displayed in near-real-time — to fix the pump before it fails.
But before diving in, you need to plan the streaming architecture in order to transform the data appropriately for machine learning. For example, the overall pipeline might look something like this:
The sensor data are managed by a messaging service, then passed to a streaming function that processes the signals and makes predictions. The model state is updated, and results are sent to the dashboard.
There may be thousands of sensors, so you need a robust, scalable messaging service, capable of handling high-frequency streams, like Apache Kafka. It’s distributed (easily scalable) and you can specify important system constraints like time windowing, managing out of order data, buffering, and more. But to decide on these parameters for the architecture, the rest of the system must be considered.
The data are streamed continuously, at clock-time, but will be passed to the functions/models in “chunks,” based on the time window, and transformed throughout the pipeline. Time is represented in multiple ways (Hz, seconds, samples,…) and it can be helpful to think through the requirements at each step:
- Actual and desired frequency & units
- Resampling methods
- Signal processing and AI algorithms
Basically, you want to think about how much data will be needed for sensible calculations and use this to set the window. When frequency is on the order of 1 Hz, a few seconds of data would be reasonable.
Leaving the Time-Domain
Due to the high frequency in these systems, it’s common to preprocess the sensor data (resample and smooth, for example), then perform calculations in the time- and frequency-domain. Most signal processing algorithms assume the data are monotonically increasing with time, so this is often the first goal in data preparation. For example, the pump signals (pressure, volume, flow) are first synchronized to a constant time step and resampled via linear interpolation.
After the initial preprocessing, the next step is to prepare the data for input into machine and deep learning models. The goal is to find the representation of the signal with the most information. This often includes summary statistics like mean, skewness, and kurtosis. In addition, frequency information like the spectral density, Fourier transform, and peak-to-peak distances can characterize signal features beyond the temporal information.
If you’re less comfortable in the frequency domain, you can explore the spectral information visually with a spectrogram (below). See this example for a practical introduction to time-frequency analysis.
Spectral Visualizations in MATLAB
The signal processing calculations capture features from the data, which are used for predictive modeling. This could include several types of models like deep networks (LSTMs), system identification, and regression models, which can incorporate time and may also change over time.
Consider an example of predicting the remaining useful life of the pump. Even the unit conversions require some thought — the input data are in Hz and outputs are in days, months, years. In addition, the lifetime decreases with usage, so the model must be continuously updated for new data. This requires some planning for the architecture and can be managed by caching the model info (using Redis cache or similar) and updating it with each prediction.
These are just some of the considerations in designing a streaming architecture for sensor data. My upcoming talk at ODSC West, “Deploying AI for Near Real-Time Engineering Decisions” will focus on building a system to address these challenges using MATLAB, Python, Apache Kafka, and Microsoft Azure.