This page examines clustering with an emphasis on prototype clustering methods like K-means. It is a part of
BDAC - Big Data Approximating Control
Overview of clustering and prototype clustering
Clustering is common in “big data” analysis as a way to organize and understand systems, efficiently find nearest neighbors after analysis, compress data, or act as a basis for fault detection and isolation. Clustering is purely “data-driven”. The system does not need to have a mathematical model - it just needs to have associated data. In many cases, it may be theoretically possible to derive models. However, modeling might not be done because is expensive and time consuming to identify or derive the model and all the parameters in the model. Examples of this include heating and cooling in buildings, and many complex industrial processes such as distillation and chemical reactors. In other cases, there may be little or no basis for modeling by mathematical equations.
Consider a system with n variables, where each of those n variables can take different values at different times. We think of the n values at a given time as a “case”: a single data point (vector) in an n-dimensional space. When a large amount of this multivariable data must be analyzed, it is often desirable to group “similar” data points into groups called clusters. “Similar” generally means “close”, using some definition of distance such as a weighted Euclidean distance (square root of the weighted sum of squares of the difference in distances along each axis). An example of a cluster analysis for points in 2 dimensions is shown at the left, taken from the Wikipedia article on cluster analysis. That analysis found 3 clusters, indicated by three different colors for the data points represented as squares.
The centroid of a cluster (vector average of the cluster members, also called the mean) is often meaningful as a “prototype”, representative of the cluster. Clustering emphasizing centroids is called “prototype clustering”.
A good overview of clustering can be found in chapter 8 of the book Introduction to Data Mining by Tan, Steinback, and Kumar.
The earliest method for prototype clustering is K-means clustering. It is still popular because of its simplicity. Given a set of data points (vectors of feature values, also called cases here), the algorithm starts, given K initial cluster centers (centroids). The choice of the number K and the initial choice of the K cluster centers is arbitrary, although there has been a lot of research on choosing K and the initial K cases. Then the algorithm iterates as follows: To add each new case:
- Assign each case to the closest cluster center (centroid)
- Recalculate the centroid of each cluster
- Reset the cluster centers to the new centroids
- Repeat until the centroids stop changing/each case’s cluster assignment is stable
Any distance measure can be used. The clusters are spherical when using a distance with equal weighting for each vector element, such as a simple Euclidean distance. The clusters are ellipsoidal when using weighting factors such as a weighted Euclidean distance.
K-means some drawbacks. The “best” K is unknown, and could change over time in many applications. It may also be difficult to choose an initial set of centroids, and the results are sensitive to this initial choice. Clustering can be represented as an attempt at a solution to an optimization problem, but K-means does not guarantee global convergence: the results depend on the initial choice of centroids. Another issue is that adding a new case causes an unpredictable number of iterations. Another issue is that K-means assumes that all clusters are the same size. This may lead to requiring a large number of clusters in some applications. The clusters are also convex, which may lead to requiring a large number of clusters in some problems such as classification for fault detection and isolation.
Supervised vs. unsupervised learning for use in classification
K-means and other clustering methods develop clusters without human intervention: they are “unsupervised”. Each cluster could be automatically labeled by default as “cluster 1”, “cluster 2”, and so on. For any new observation, a classification task answers the question “what class is this in?” A class is a set of conditions that might correspond to a particular operating mode or a fault, for instance. A class includes one or more clusters. One approach to classification is to pick the cluster with the closest centroid, assuming that centroid has been assigned to a class. Classification typically involves “supervised learning’, where humans assign labels to clusters.
Clustering can be converted to at least partly supervised form by having humans optionally label at least some the clusters when possible. For instance, for a fault detection application, a person adds a label “abnormal” to at least the clusters for conditions considered abnormal. For a full diagnosis (fault isolation), a more specific name such as “loss of power to unit 3” is needed.
Extensions for dynamic systems and real-time use
Basic clustering algorithms like K-means can be used for static systems or dynamic systems. Dynamic systems can be addressed by incorporating time trajectories of all variables within a sliding time window, into a single vector at each time step. This is explained in more detail in the section on BDAC data representation.
These basic methods apply to time-invariant systems, whether static or dynamic. That means that the possible combinations of variable values do not change over time. If the system behavior changes over time (is not time-invariant), adaptation to new behavior is desired, or analysis or decisions are needed based on a stream of incoming data, then see the section on real time clustering.
Copyright 2017, Greg Stanley
(Return to BDAC - Big Data Approximating Control )