A Novel Effective Distance Measure and a Relevant Algorithm for Optimizing the Initial Cluster Centroids of K-means

Yang Liu,Shuaifeng Ma,Xinxin Du
DOI: https://doi.org/10.1109/access.2020.3044069
IF: 3.9
2021-01-01
IEEE Access
Abstract:The traditional K-means algorithm is very sensitive to the selection of the initial clustering point and the calculation of the distance measure, which is likely to result in the convergence of only partly optimal solutions. An improved k-means algorithm is proposed to solve the problem of unbalanced clustering effect caused by the fact that the first initial clustering centre falls in the non-dense region of the boundary in the initial clustering centre optimisation process. An improved k-means algorithm for initial clustering centres is proposed, namely, the optimal matching algorithm for K-means clustering, and related experimental analysis of the algorithm is carried out. The improved algorithm first selects the initial points of the traditional K-means clustering algorithm and analyses the clustering results. Then, the initial clustering centre selection and distance determination were tested and the clustering effect was evaluated by introducing the contour coefficient. Experiments on both artificial data sets and UCI data sets show that the algorithm can achieve better clustering results. The experimental results indicate that the improved algorithm has a much higher clustering quality than the traditional K-means algorithm and other improved algorithms.
computer science, information systems,telecommunications,engineering, electrical & electronic
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the sensitivity of the traditional K - means algorithm in the selection of initial cluster centers and the calculation of distance metric, which may cause the algorithm to converge only to some local optimal solutions. Specifically, when the initial cluster centers fall within the non - dense areas in the boundary regions, it will lead to unbalanced clustering results. Therefore, the paper proposes an improved K - means algorithm, aiming to solve these problems by optimizing the selection method of initial cluster centers, thereby improving the quality and stability of clustering. ### Overview of the Improved K - means Algorithm 1. **Selection of Initial Cluster Centers**: - The traditional K - means algorithm is very sensitive to the selection of initial cluster centers and is prone to fall into local optimal solutions. - The paper proposes a new method for selecting initial cluster centers, namely the Optimal Matching Algorithm for K - means Clustering. This method first selects the initial points of the traditional K - means algorithm and then analyzes the clustering results. 2. **Distance Metric**: - The traditional K - means algorithm usually uses the Euclidean distance to calculate the distance between data points, but this method may not be suitable for all types of data sets. - The paper introduces the Silhouette Coefficient to evaluate the clustering effect. By optimizing the initial cluster centers and the distance metric, the accuracy of clustering is improved. 3. **Experimental Verification**: - The paper conducts experiments on artificial data sets and UCI standard data sets to verify the effectiveness of the improved algorithm. - The experimental results show that the improved K - means algorithm is significantly superior to the traditional K - means algorithm and other improved algorithms in terms of clustering quality. ### Formula Representation 1. **Euclidean Distance**: - For two points \((x_1, x_2,\ldots, x_n)\) and \((y_1, y_2,\ldots, y_n)\), the Euclidean distance \(d(x, y)\) can be expressed as: \[ d(x, y)=\sqrt{\sum_{i = 1}^{n}(x_i - y_i)^2} \] 2. **Silhouette Coefficient**: - The Silhouette Coefficient \(s(i)\) is used to evaluate the clustering effect of a single sample, and the calculation method is as follows: - Calculate the average distance \(a(i)\) from sample \(i\) to other samples in the same cluster. - Calculate the minimum average distance \(b(i)\) from sample \(i\) to all samples in other clusters. - The Silhouette Coefficient \(s(i)\) is defined as: \[ s(i)=\frac{b(i)-a(i)}{\max\{a(i), b(i)\}} \] - The value range of the Silhouette Coefficient is \([- 1,1]\), and the closer the value is to 1, the better the clustering effect. ### Conclusion The paper effectively solves the shortcomings of the traditional K - means algorithm by improving the selection method of initial cluster centers and introducing the Silhouette Coefficient to evaluate the clustering effect, and improves the quality and stability of clustering. The experimental results show that the improved algorithm exhibits excellent performance on multiple data sets.