Abstract:K-Means clustering algorithm is one of the most commonly used clustering algorithms because of its simplicity and efficiency. K-Means clustering algorithm based on Euclidean distance only pays attention to the linear distance between samples, but ignores the overall distribution structure of the dataset (i.e. the fluid structure of dataset). Since it is difficult to describe the internal structure of two data points by Euclidean distance in high-dimensional data space, we propose a new distance measurement, namely, view-distance, and apply it to the K-Means algorithm. On the classical manifold learning datasets, S-curve and Swiss roll datasets, not only this new distance can cluster the data according to the structure of the data itself, but also the boundaries between categories are neat dividing lines. Moreover, we also tested the classification accuracy and clustering effect of the K-Means algorithm based on view-distance on some real-world datasets. The experimental results show that, on most datasets, the K-Means algorithm based on view-distance has a certain degree of improvement in classification accuracy and clustering effect.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that when the traditional K - Means clustering algorithm only relies on the Euclidean distance for clustering in high - dimensional data space, it cannot fully describe the internal structure and manifold structure between data points. Specifically:
1. **Limitations of Euclidean distance**: The traditional K - Means algorithm is based on the Euclidean distance. It only focuses on the linear distance between samples and ignores the overall distribution structure of the data set (i.e., the manifold structure of the data). In high - dimensional data space, the Euclidean distance is difficult to accurately describe the internal relationship between two data points.
2. **Improving clustering effect**: To overcome the above problems, the author proposes a new distance measurement method - view - distance and applies it to the K - Means algorithm. The view - distance not only considers the Euclidean distance between samples but also considers their projection distances on different hyperplanes, thereby better capturing the manifold structure of the data.
3. **Experimental verification**: The author verifies through experiments on classical manifold learning data sets (such as S - curve and Swiss roll) and some actual data sets that the K - Means algorithm based on view - distance has improved in both classification accuracy and clustering effect.
### Formula representation
The view - distance is defined as follows:
Given a sample set \(X\subset\mathbb{R}^m\), for any two samples \(x=(x_1,x_2,\ldots,x_m)^T\in X\) and \(y=(y_1,y_2,\ldots,y_m)^T\in X\), the view - distance \(d_v\) is defined as:
\[
d_v=\frac{(m - 2)!}{2}\sum_{1\leq i<j\leq m}d_E((x_i,x_j),(y_i,y_j))
\]
where \(d_E((x_i,x_j),(y_i,y_j))\) represents the Euclidean distance between the vectors \((x_i,x_j)\) and \((y_i,y_j)\).
The simplified calculation formula is:
\[
d_v=\sum_{1\leq i<j\leq m}d_E((x_i,x_j),(y_i,y_j))
\]
### Experimental results
The experimental results show that the K - Means algorithm based on view - distance can cluster according to the manifold structure of the data on the S - curve and Swiss roll data sets, and the boundaries between categories are more distinct. In addition, on multiple actual data sets, the K - Means algorithm based on view - distance has also improved in classification accuracy and clustering effect.
### Summary
By introducing the new distance measurement method of view - distance, this paper solves the problem that the traditional K - Means algorithm cannot effectively describe the internal structure of data in high - dimensional data space, thereby improving the clustering effect and accuracy.