Exploration and Performance Analysis of Clustering Algorithms for Time-Series Data with Dimension Reduction
Abstract
Clustering is an attempt to form groups of similar objects, and it is a powerful tool fordiscovering valuable underlying patterns in the data. When clustering on high dimensionaldata, the algorithms can suffer from the curse of dimensionality. This is a problem thatoccurs when data becomes sparse due to many dimensions, and can lead to poor clusteringperformance. Dimensionality reduction methods (DRMs) are thus designed to help alleviate this issue. For a time-series that is a temporal set of points, each consecutive pointin time can be considered a dimension and therefore it belongs to high dimensional data.Time-Series K-Means (TSK-Means) with Dynamic Time Warping (DTW) is an algorithmthat has been proven successful for clustering time-series. However, TSK-Means is computationally complex and might require substantial training time due to the potentially highdimensionality of time-series.This thesis studies the clustering of time-series data, provided by temperature sensors installed in refrigerators, trying to make it less computationally complex by the use of theDRMs Principal Component Analysis (PCA), Time-Series Autoencoder (TSA), and SelfOrganizing Maps (SOM). We utilize these methods in combination with three clustering algorithms, namely, K-Means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Agglomerative Hierarchical Clustering (AHC), to potentially find valuable patterns in the provided data. The clusters and patterns were evaluated on a theoretical andpractical level regarding the application of pattern recognition and detection in the domainof refrigerator temperature monitoring and logging. This is an effort to improve refrigeratormaintenance and quality assurance, deviation management, and to potentially reduce foodloss.The results indicate that TSK-Means outperforms any other combination of DRMs andclustering algorithms when it comes to detecting patterns in the data, despite being morecomputationally complex. Regardless, the use of DRMs simplified the clustering process oftime-series, and allowed the K-Means algorithm to detect patterns more efficiently than theTSK-Means algorithm. The clusters and patterns that were discovered seem promising forthe application of deviation management and refrigerator quality assurance.