Evaluation of k-means time series clustering based on z-normalization and NP-Free

Ming-Chang Lee,Jia-Chun Lin,Volker Stolz
2024-01-29
Abstract:Despite the widespread use of k-means time series clustering in various domains, there exists a gap in the literature regarding its comprehensive evaluation with different time series normalization approaches. This paper seeks to fill this gap by conducting a thorough performance evaluation of k-means time series clustering on real-world open-source time series datasets. The evaluation focuses on two distinct normalization techniques: z-normalization and NP-Free. The former is one of the most commonly used normalization approach for time series. The latter is a real-time time series representation approach, which can serve as a time series normalization approach. The primary objective of this paper is to assess the impact of these two normalization techniques on k-means time series clustering in terms of its clustering quality. The experiments employ the silhouette score, a well-established metric for evaluating the quality of clusters in a dataset. By systematically investigating the performance of k-means time series clustering with these two normalization techniques, this paper addresses the current gap in k-means time series clustering evaluation and contributes valuable insights to the development of time series clustering.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the impact of different time - series normalization techniques (especially z - normalization and NP - Free) on the performance of k - means time - series clustering, especially in terms of clustering quality. Specifically, the paper aims to fill the gap in the existing literature regarding the comprehensive evaluation of k - means time - series clustering under different normalization methods. ### Main problems of the paper 1. **Evaluation of k - means time - series clustering**: - Although k - means time - series clustering has been widely used in multiple fields, there are still deficiencies in its comprehensive evaluation under different normalization methods. 2. **Selection of normalization techniques**: - This paper selects two normalization techniques for evaluation: z - normalization and NP - Free. Z - normalization is one of the most commonly used normalization methods, and NP - Free is a real - time time - series representation method and can be used as an alternative to normalization methods. 3. **Evaluation of clustering quality**: - Use the silhouette score as an evaluation metric to measure the quality of clustering. The silhouette score is an index used to evaluate the similarity between data points and their respective clusters, and a value closer to 1 indicates a better clustering effect. ### Experimental design - **Data set**: Two real - world open - source time - series data sets from the UEA&UCR repository are used: GunPointPointTrain and GunPointMaleTrain. - **Experimental method**: - Apply z - normalization - based k - means clustering (denoted as z - kmeans) and NP - Free - based k - means clustering (denoted as NPF - kmeans) to each data set respectively. - By changing the parameter k (the number of clusters) and calculating the average silhouette score of each method, compare the clustering effects of the two methods. ### Main findings - For all tested k values, the silhouette score of NPF - kmeans is higher than that of z - kmeans, indicating that NP - Free provides better clustering quality than z - normalization on these data sets. - When the GunPointPointTrain data set is divided into 15 clusters, both methods reach the highest silhouette score, and the performance of NPF - kmeans is particularly outstanding. ### Conclusion This research provides valuable insights for the research and development of time - series clustering by systematically evaluating the impact of different normalization techniques on k - means time - series clustering. In particular, NP - Free, as a new normalization method, performs better than the traditional z - normalization method in some cases, which provides new ideas for further improving time - series clustering algorithms. ### Formula presentation - The formula for z - normalization is: \[ Z_i=\frac{x_i - \mu}{\sigma} \] where \( x_i \) is the \( i \) - th data point in the time series, \( \mu \) is the mean of all data points, and \( \sigma \) is the standard deviation of all data points. - The formula for NP - Free to calculate RMSE is: \[ RMSE_t = \sqrt{\frac{1}{3}\sum_{i = t - 2}^{t}(c_i-\hat{c}_i)^2} \] where \( c_i \) is the actual data point and \( \hat{c}_i \) is the predicted data point. Through these detailed analyses and evaluations, the paper provides important references and guidance for the research of time - series clustering.