Clustering with Transitive Distance and K-Means Duality

Chunjing Xu,Jianzhuang Liu,Xiaoou Tang
DOI: https://doi.org/10.48550/arXiv.0711.3594
2007-11-22
Abstract:Recent spectral clustering methods are a propular and powerful technique for data clustering. These methods need to solve the eigenproblem whose computational complexity is $O(n^3)$, where $n$ is the number of data samples. In this paper, a non-eigenproblem based clustering method is proposed to deal with the clustering problem. Its performance is comparable to the spectral clustering algorithms but it is more efficient with computational complexity $O(n^2)$. We show that with a transitive distance and an observed property, called K-means duality, our algorithm can be used to handle data sets with complex cluster shapes, multi-scale clusters, and noise. Moreover, no parameters except the number of clusters need to be set in our algorithm.
Machine Learning
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the efficiency and performance issues of existing clustering algorithms when dealing with datasets of complex shapes, multiple scales and containing noise. Specifically: 1. **High computational complexity**: Existing spectral clustering methods need to solve eigenvalue problems, and their computational complexity is \(O(n^3)\). When the dataset is large, the computational cost is too high. 2. **Sensitivity to parameters**: Many clustering algorithms (such as K - means and EM) assume that the data has a certain underlying structure (for example, hyper - ellipsoidal or Gaussian distribution), and need to adjust parameters to obtain good results. 3. **Difficulty in handling clusters of complex shapes**: Traditional methods perform poorly when dealing with clusters of complex shapes or multi - scale clusters. To overcome these problems, the author proposes a clustering method based on transitive distance and K - means duality without eigenvalue problems. The main contributions of this method include: - **Introduction of transitive distance**: By defining the transitive distance, the actual relationship between samples can be better reflected, so that clusters of complex shapes can be more compactly represented in the new space. - **K - means duality**: Using K - means duality, clustering can be directly performed based on the distance matrix without relying on coordinates. - **Low computational complexity**: The computational complexity of the new algorithm is \(O(n^2)\), which is more efficient than the \(O(n^3)\) of spectral clustering methods. - **No need to adjust parameters**: Except for specifying the number of clusters, no other parameters need to be set, simplifying the use process. In summary, this paper aims to propose an efficient and robust clustering method that can significantly reduce computational complexity while maintaining performance comparable to spectral clustering algorithms, and can handle datasets of complex shapes, multiple scales and containing noise.