Hierarchical Clustering using Auto-encoded Compact Representation for Time-series Analysis

Soma Bandyopadhyay,Anish Datta,Arpan Pal
DOI: https://doi.org/10.48550/arXiv.2101.03742
2021-01-11
Abstract:Getting a robust time-series clustering with best choice of distance measure and appropriate representation is always a challenge. We propose a novel mechanism to identify the clusters combining learned compact representation of time-series, Auto Encoded Compact Sequence (AECS) and hierarchical clustering approach. Proposed algorithm aims to address the large computing time issue of hierarchical clustering as learned latent representation AECS has a length much less than the original length of time-series and at the same time want to enhance its <a class="link-external link-http" href="http://performance.Our" rel="external noopener nofollow">this http URL</a> algorithm exploits Recurrent Neural Network (RNN) based under complete Sequence to Sequence(seq2seq) autoencoder and agglomerative hierarchical clustering with a choice of best distance measure to recommend the best clustering. Our scheme selects the best distance measure and corresponding clustering for both univariate and multivariate time-series. We have experimented with real-world time-series from UCR and UCI archive taken from diverse application domains like health, smart-city, manufacturing etc. Experimental results show that proposed method not only produce close to benchmark results but also in some cases outperform the benchmark.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to achieve robust time - series clustering in time - series analysis while selecting the best distance metric method and appropriate time - series representation. Specifically, the author proposes a novel mechanism that combines the learned compact time - series representation (Auto Encoded Compact Sequence, AECS) and the hierarchical clustering method to solve the problem of long computational time in traditional hierarchical clustering and improve clustering performance. ### Specific description of the problem 1. **Robust time - series clustering**: - Time - series data usually lacks labels, and the knowledge cost of domain experts is high. Therefore, effective unsupervised learning methods are required to discover patterns, groups, and subgroups. - In fields such as medical care, manufacturing, and smart cities, the complexity and diversity of time - series data increase the difficulty of clustering. 2. **Selecting the best distance metric method**: - Different distance metric methods (such as Chebyshev distance, Manhattan distance, Mahalanobis distance, etc.) have different impacts on clustering results, and it is crucial to select an appropriate method. - An internal clustering validation measure is required to evaluate and select the best distance metric method. 3. **Efficient time - series representation**: - Traditional hierarchical clustering methods have high computational overhead when dealing with long sequences, resulting in low efficiency. - A compact time - series representation method is required, which can not only reduce the computational time but also retain the important features of the time - series. ### Solution The solutions proposed by the author include the following aspects: 1. **Auto Encoded Compact Sequence (AECS)**: - Use Seq2Seq LSTM auto - encoder to learn the compact representation of time - series, and the length of the generated latent representation is much shorter than the length of the original time - series. - This compact representation not only reduces the computational time but also captures the key features of the time - series. 2. **Hierarchical clustering**: - Apply the agglomerative hierarchical clustering method to cluster AECS. - Use the average linkage method to calculate the similarity between clusters and form non - convex - shaped clusters to meet the needs of various practical applications. 3. **Distance metric selection**: - Compare three different distance metric methods: Chebyshev distance, Manhattan distance, and Mahalanobis distance. - Use the Modified Hubert Statistic (T) as an internal clustering validation index and select the clustering result with the highest T value. 4. **Extensive experimental verification**: - Conduct experiments on multiple univariate and multivariate time - series datasets in the UCR Time - Series Classification Archive and the UCI Machine Learning Library. - The experimental results show that this method can not only reach the level of the benchmark algorithm but even exceed the benchmark algorithm in some cases. Through these methods, the author effectively solves the problems of computational efficiency and performance improvement in time - series clustering, providing strong support for practical applications.