Nonlinear Markov Clustering by Minimum Curvilinear Sparse Similarity

C. Duran,A. Acevedo,S. Ciucci,A. Muscoloni,CV. Cannistraci
DOI: https://doi.org/10.48550/arXiv.1912.12211
2019-12-28
Abstract:The development of algorithms for unsupervised pattern recognition by nonlinear clustering is a notable problem in data science. Markov clustering (MCL) is a renowned algorithm that simulates stochastic flows on a network of sample similarities to detect the structural organization of clusters in the data, but it has never been generalized to deal with data nonlinearity. Minimum Curvilinearity (MC) is a principle that approximates nonlinear sample distances in the high-dimensional feature space by curvilinear distances, which are computed as transversal paths over their minimum spanning tree, and then stored in a kernel. Here we propose MC-MCL, which is the first nonlinear kernel extension of MCL and exploits Minimum Curvilinearity to enhance the performance of MCL in real and synthetic data with underlying nonlinear patterns. MC-MCL is compared with baseline clustering methods, including DBSCAN, K-means and affinity propagation. We find that Minimum Curvilinearity provides a valuable framework to estimate nonlinear distances also when its kernel is applied in combination with MCL. Indeed, MC-MCL overcomes classical MCL and even baseline clustering algorithms in different nonlinear datasets.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations of the existing Markov Clustering (MCL) algorithm when dealing with non - linear data. Specifically, although the traditional MCL algorithm can detect the structural organization in data by simulating the random flow of sample similarity in the network, it has never been generalized to deal with the non - linear characteristics of data. To solve this problem, the author introduced the Minimum Curvilinearity (MC) principle and combined it with MCL to propose the MC - MCL algorithm. ### Main problems: 1. **Non - linear data processing**: The traditional MCL algorithm cannot effectively process data with non - linear patterns. 2. **Improving clustering performance**: A new method is required to enhance the performance of MCL on real and synthetic datasets, especially when the data presents non - linear patterns. ### Solutions: - **Minimum Curvilinearity (MC) principle**: The MC principle calculates the curvilinear distances of samples in the high - dimensional feature space, which are calculated through the lateral paths on the Minimum Spanning Tree (MST). These curvilinear distances are stored in a kernel function. - **MC - MCL algorithm**: Applying the MC principle to the MCL algorithm creates a non - linear kernel - extended version of MCL, called MC - MCL. This algorithm uses MC to enhance the performance of MCL on non - linear data. ### Experimental verification: To verify the effectiveness of MC - MCL, the author compared it with baseline clustering methods such as DBSCAN, K - means and Affinity Propagation. The experimental results show that MC - MCL not only outperforms the classical MCL on multiple non - linear datasets, but also outperforms other landmark clustering algorithms in the general evaluation framework. ### Formula representation: - **MC distance matrix**: \[ D_{MC} \] - **Sparse MC similarity kernel**: \[ f(x)=\max[0,(1 - x - t)] \] where \( x \) is the original MC distance and \( t \) is the automatically detected threshold. - **Converting Euclidean distance to similarity**: \[ f(x)=\max\left[0,\left(1-\frac{x}{\max(x)}-t\right)\right] \] Through these improvements, MC - MCL can provide more accurate and effective clustering results when dealing with non - linear data.