Real time clustering and filtering with RTMAC and RTEFC
This page examines clustering and filtering of data in real time, a form of multivariate time series analysis. It is a part of BDAC  Big Data Approximating Control
Analyzing real time data
Using real time data poses many challenges. Filtering is needed to reduce noise. Clustering for real time should compress the massive and evergrowing quantity of data, recognize operating condition changes, remember unusual conditions, and further filter noise by averaging similar conditions at different times. Clustering should also systematically adapt to equipment degradation/renewal over time when desired. Knowledge of causality should be exploited to further filter out the effects of unmeasured disturbances and process noise. RTEFC ("Real Time Exponential Filter Clustering") and RTMAC (“Real Time Moving Average Clustering”) are efficient and easilyimplemented new real time clustering algorithms that address these issues when prototype clustering is applicable. These algorithms create and update clustered data sets for datadriven applications, useful for either building models or for direct pattern matching without models, as needed in BDAC.
The focus of this section is on analyzing data collected over time. Data sources include sensors that monitor a system, values of actuators (control system outputs), or other calculated or sampled signals. For realtime systems, the sampling never stops, and new analysis may be needed immediately at each time step. This data is time series data. The time series are multivariate because almost all systems of interest have multiple variables. Even a singleinput singleoutput control system has time series for the input and output values.
Dynamic systems are addressed by incorporating time trajectories of all variables within a sliding time window, into a single vector at each time step. This is explained in more detail in the section on BDAC data representation.
Some applications involve systems that could be modeled by mathematical equations. These algorithms can be used for static systems or dynamic systems. They also can be used whether or not the systems are timeinvariant. A static system, for example, might be represented as a set of algebraic equations. If all parameters in those equations remain fixed over time, the system is timeinvariant. If some parameters can change over time, the system is timevarying. Dynamic systems include those modeled by differential equations, or the equivalent versions written in discrete time form. Again, these can be timeinvariant or not, depending on whether or not all parameters remain constant over time.
The system does not need to have a mathematical model  it just needs to have associated time series data. So, for instance, traffic speed data for road segments over time can be represented. So could video signals  clusters could be found within the frame for each time step, and then modified in the next frame because of movement or lighting changes.
Overview of clustering and prototype clustering
Suppose we are looking at a system with n variables, where each of those n variables can take different values at different times. We think of the n values at a given time as a “case”: a single data point (vector) in an ndimensional space. When a large amount of this multivariable data must be analyzed, it is often desirable to group “similar” data points into groups called clusters. “Similar” generally means “close”, using some definition of distance such as a weighted Euclidean distance (square root of the weighted sum of squares of the difference in distances along each axis). The process of clustering is common in “big data” analysis as a way to organize and understand systems, efficiently find nearest neighbors after analysis, or compress data.
The centroid of a cluster (vector average of the cluster members, also called the mean) is often meaningful as a “prototype”, representative of the cluster. Clustering emphasizing centroids is called “prototype clustering”.
A good overview of clustering can be found in chapter 8 of the book Introduction to Data Mining by Tan, Steinback, and Kumar.
The clustering methods here were developed as a way to create and maintain a data set for use in the modelfree BDAC approach to control, introduced in the paper Big Data Approximating Control (BDAC)  A new modelfree estimation and control paradigm based on pattern matching and approximation . BDAC uses the centroids as an efficient way to represent possible system behavior, as a way to filter out the effects of noise and umeasured disturbances, and also track process changes. However, models developed for process control methods like MPC (Model Predictive Control) or other applications are also subject to the same issues of noise, unmeasured disturbances, and selection of the best data for use in modeling to incorporate unusual conditions as well as normal conditions. These techniques are applicable for pretreating the data for model identification, as well as preparing and maintaining a data set for subsequent neural network training. More generally,they should be suitable for datadriven realtime applications that can use prototype clustering, where centroids are the main interest.
Kmeans clustering and its limitations for real time use
The earliest method for prototype clustering is Kmeans clustering. It is still popular because of its simplicity. Given a set of data points (cases), the algorithm starts by choosing K of those cases as cluster centers (centroids). The choice of the number K and the initial choice of the K cases is arbitrary, although there has been a lot of research on choosing K and the initial K cases. Then the algorithm iterates as follows:
 Assign each case to the closest cluster center (centroid)
 Recalculate the centroid of each cluster
 Reset the cluster centers to the new centroids
 Repeat until the centroids stop changing/each case’s cluster assignment is stable
This approach has major drawbacks for real time use, where new data is always arriving and the process behavior may change over time. The “best” K is unknown, and can grow in many applications. Following implementation in many process control applications, new operating modes may occur, and the process changes over time due to feed or product changes, equipment degradation and renewal, and so on. A related problem is that the core algorithm doesn’t specify how to recognize and add new clusters. Another problem is that each new case causes an unpredictable number of iterations. This can be a problem in “hard real time” systems where computation must be guaranteed to complete in a fixed time. But the biggest problem is much more basic: there is no mechanism to forget old data. The data set will continue to grow forever, impacting memory usage and computing time. Furthermore, some of that data will become outdated, and should not be weighted equally with more current data.
Real time clustering with RTMAC (Real Time Moving Average Clustering)
RTMAC (Real Time Moving Average Clustering), sometimes shortened just MAC, is a variation on KMeans clustering. In Kmeans, set a limit nMax on the maximum number of cases per cluster. Then, with each new case:
 Find the cluster with the closest centroid
 Include the new case in that cluster
 If that cluster already has the maximum size nMax, then delete the oldest member of that cluster
 Calculate new centroid values and optionally iterate as in Kmeans
A variation on this algorithm is to specify a maximum cluster radius so that new clusters can be started when needed to accomodate newly discovered system behavior. This also allows the clustering to start with no initial cases, adding new cluster centers as needed. (A parameter nClustersMax to limit the total number of clusters is also useful.) This approach is demonstrated in the section on RTEFC below.
RTMAC is analogous to moving average filters of length nMax in time series analysis. The initial new centroid immediately after adding a new case is in effect the output of that cluster’s moving average filters: the newest case is added, the oldest case is dropped, and the moving average for each dimension is recomputed. For each new case, only one centroid is updated (before the Kmeans iteration). This suggests that other filters from time series analysis should be considered, especially those requiring less computational effort and data storage. This leads to the development of RTEFC, which also eliminates the Kmeans iteration.
Real time clustering with RTEFC (Real Time Exponential Filter Clustering)
RTEFC (Real Time Exponential Filter Clustering), sometimes shortened to just EFC, replaces the moving average calculation with an exponential filter. This is applicable in the important special case where we care about using the cluster centers (centroids), rather than the original data. For instance, in BDAC, the centroids are used as approximate representatives of feasible system trajectories.
The cluster centers (centroids) are stored as the rows {s_{i}} of a matrix S. We define a new case s_{new} as “novel” if the closest distance d to every existing case exceeds a threshold radius, so that d_{close} =  s_{new}  s_{close}  , where s_{close} is the closest case to s_{new} .
The RTEFC algorithm starts with an empty matrix S. At each time step, given a new case s_{new},
 Calculate distances from s_{new} to rows of S, finding closest match s_{close} and its distance d_{close}
 If s_{new} is novel and dim(S) < nCasesMax
 then add s_{new} to S
 else update s_{close} to as_{close} + (1a)s_{new} (exponential filtering)
where dim(S) is the number of rows of S (number of cases) and a is the exponential filter constant. The parameter nCasesMax is a limit on the total number of cases allowed (which is the same as the number of clusters). Optionally, we remove linearly dependent cases for subsequent use of centroids with linear methods.
RTEFC operation is shown in the following diagram. (Click for full sized image.)
Understanding the exponential filtering behavior in RTEFC
The filtered output s[k] of a (vector) exponential filter over time steps k is
s[k] = a s[k1] + (1a) s_{new}[k]
For a fixed time step T between samples, this filter is equivalent to a first order lag with time constant :
a = exp(T/) or equivalently = T/log(a)
The time constant is the time it takes to achieve 63.21% of the final value in response to a step change in the input. 98.17% of the response is complete in 4 . For example, for a sample time of T = 1 minute, a = 0.9 corresponds to = 9.49 minutes. For a = 0.998, the corresponding = 499.5 minutes. The step response of a singlevariable exponential filter follows (click for a fullsized image) :
But RTEFC is not just simple exponential filtering. One obvious difference is that ECF is multivariable (although it shares a single filter parameter a for n filters for an entire vector of n values). But more significantly, at each time step, only one cluster is updated. The rest of the clusters remain unchanged. This makes RTEFC efficient, and also eliminates the unpredictable amounts of iteration found in Kmeans clustering. This approach also allows unusual operating conditions to be remembered, because clusters representing unusual conditions are not updated during more normal operations. This contrasts with other techniques such as recursive least squares approaches to building models. Those recursive approaches uniformly forget all old data exponentially. For typical process control applications, there are nonlinearities and a variety of different operating conditions, so simple recursive approaches are poorly suited compared to clustering approaches that can remember the unusual conditions.
So, are RTMAC and RTEFC examples of filtering or clustering? Yes!
An example of RTEFC building training cases for datadriven applications
The following plot shows an example of RTEFC operation over time. The vertical axis shows the identity of the closest cluster number to the new case at each time step. The cluster number ids start at 0. The blue dots identify the closest case, while the green dots across the bottom of the plot highlight when a novel case is found (so that a new cluster is started), and some other special conditions. This particular application (BDAC) has individual cases representing entire trajectories of multiple variables over a time window. It ignores some data for reasons such as being at steady state, or not having run long enough to collect data for an entire time window. In those cases, the cluster id is shown as less than zero. Those application specific details of BDAC don’t matter for just discussing RTEFC. (Click on the plot for a fullsized image)
Once the first cluster (numbered 0) is started, the next few cases are close enough that they are absorbed in that first cluster. After several time periods, eventually a new case is different enough that a new cluster is started. Several subsequent cases are absorbed into that cluster. This continues for a while, but then due to some cyclic behavior, cases start to repeat around time 300. Those cases are absorbed into earlier clusters. Around time 700, a different set of operating conditions leads to the creation of new clusters. Over time, fewer and fewer new clusters are formed. Instead, the effects of noise and unmeasured variables are filtered out.
Clustering for causal systems
For causal systems, process input changes subsequently cause process output changes. Analyzing the relationship between the inputs and outputs is a key step in numerous fields, including process control. Ideally, for a given set of process inputs, the outputs would be the same. However, that is not the case because of
 Noise or errors in input and output measurements
 Unmeasured disturbances
 Process noise
 Process changes over time
We often want to cluster just based on process inputs, to filter out the effects of noise and unmeasured disturbances, and also to adapt to process changes over time. The solution is to set the weights of the process outputs to 0 in any norm or distance calculation.
Copyright 2017, Greg Stanley
(Return to BDAC  Big Data Approximating Control )
