Abstract:K-Means clustering algorithm is one of the most commonly used clustering algorithms because of its simplicity and efficiency. K-Means clustering algorithm based on Euclidean distance only pays attention to the linear distance between samples, but ignores the overall distribution structure of the dataset (i.e. the fluid structure of dataset). Since it is difficult to describe the internal structure of two data points by Euclidean distance in high-dimensional data space, we propose a new distance measurement, namely, view-distance, and apply it to the K-Means algorithm. On the classical manifold learning datasets, S-curve and Swiss roll datasets, not only this new distance can cluster the data according to the structure of the data itself, but also the boundaries between categories are neat dividing lines. Moreover, we also tested the classification accuracy and clustering effect of the K-Means algorithm based on view-distance on some real-world datasets. The experimental results show that, on most datasets, the K-Means algorithm based on view-distance has a certain degree of improvement in classification accuracy and clustering effect.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that when the traditional K - Means clustering algorithm only relies on the Euclidean distance for clustering in high - dimensional data space, it cannot fully describe the internal structure and manifold structure between data points. Specifically: 1. **Limitations of Euclidean distance**: The traditional K - Means algorithm is based on the Euclidean distance. It only focuses on the linear distance between samples and ignores the overall distribution structure of the data set (i.e., the manifold structure of the data). In high - dimensional data space, the Euclidean distance is difficult to accurately describe the internal relationship between two data points. 2. **Improving clustering effect**: To overcome the above problems, the author proposes a new distance measurement method - view - distance and applies it to the K - Means algorithm. The view - distance not only considers the Euclidean distance between samples but also considers their projection distances on different hyperplanes, thereby better capturing the manifold structure of the data. 3. **Experimental verification**: The author verifies through experiments on classical manifold learning data sets (such as S - curve and Swiss roll) and some actual data sets that the K - Means algorithm based on view - distance has improved in both classification accuracy and clustering effect. ### Formula representation The view - distance is defined as follows: Given a sample set \(X\subset\mathbb{R}^m\), for any two samples \(x=(x_1,x_2,\ldots,x_m)^T\in X\) and \(y=(y_1,y_2,\ldots,y_m)^T\in X\), the view - distance \(d_v\) is defined as: \[ d_v=\frac{(m - 2)!}{2}\sum_{1\leq i<j\leq m}d_E((x_i,x_j),(y_i,y_j)) \] where \(d_E((x_i,x_j),(y_i,y_j))\) represents the Euclidean distance between the vectors \((x_i,x_j)\) and \((y_i,y_j)\). The simplified calculation formula is: \[ d_v=\sum_{1\leq i<j\leq m}d_E((x_i,x_j),(y_i,y_j)) \] ### Experimental results The experimental results show that the K - Means algorithm based on view - distance can cluster according to the manifold structure of the data on the S - curve and Swiss roll data sets, and the boundaries between categories are more distinct. In addition, on multiple actual data sets, the K - Means algorithm based on view - distance has also improved in classification accuracy and clustering effect. ### Summary By introducing the new distance measurement method of view - distance, this paper solves the problem that the traditional K - Means algorithm cannot effectively describe the internal structure of data in high - dimensional data space, thereby improving the clustering effect and accuracy.

A new distance measurement and its application in K-Means Algorithm

Rethinking k-means from manifold learning perspective

A Novel Effective Distance Measure and a Relevant Algorithm for Optimizing the Initial Cluster Centroids of K-means

Sub-One Quasi-Norm-Based k-Means Clustering Algorithm and Analyses

Subspace Clustering by Directly Solving Discriminative K-means

Data Clustering: Integrating Different Distance Measures with Modified k-Means Algorithm

An Investigation into Distance Measures in Cluster Analysis

An Improved K-Means Algorithm Based on Evidence Distance

A modified k-means clustering with a density-sensitive distance metric

Intrinsic K-means clustering over homogeneous manifolds

An Effective and Efficient Algorithm for K-means Clustering with New Formulation

Speeding Up K-Means Clustering in High Dimensions by Pruning Unnecessary Distance Computations

An Improved K-means Algorithm Based on Multiple Clustering and Density.

A Generalization of Proximity Functions for K-Means

A genetic algorithm based clustering using geodesic distance measure

An Iterative Optimization Clustering Algorithm Based on Manifold Distance

Analysis of Euclidean Distance and Manhattan Distance in the K-Means Algorithm for Variations Number of Centroid K

DSKmeans: A new kmeans-type approach to discriminative subspace clustering

Improvement Study and Application Based on K-Means Clustering Algorithm

K-groups: A Generalization of K-means Clustering

A Novel Graph-Based K-Means for Nonlinear Manifold Clustering and Representative Selection.