CBMAP: Clustering-based manifold approximation and projection for dimensionality reduction

Berat Dogan
2024-09-16
Abstract:Dimensionality reduction methods are employed to decrease data dimensionality, either to enhance machine learning performance or to facilitate data visualization in two or three-dimensional spaces. These methods typically fall into two categories: feature selection and feature transformation. Feature selection retains significant features, while feature transformation projects data into a lower-dimensional space, with linear and nonlinear methods. While nonlinear methods excel in preserving local structures and capturing nonlinear relationships, they may struggle with interpreting global structures and can be computationally intensive. Recent algorithms, such as the t-SNE, UMAP, TriMap, and PaCMAP prioritize preserving local structures, often at the expense of accurately representing global structures, leading to clusters being spread out more in lower-dimensional spaces. Moreover, these methods heavily rely on hyperparameters, making their results sensitive to parameter settings. To address these limitations, this study introduces a clustering-based approach, namely CBMAP (Clustering-Based Manifold Approximation and Projection), for dimensionality reduction. CBMAP aims to preserve both global and local structures, ensuring that clusters in lower-dimensional spaces closely resemble those in high-dimensional spaces. Experimental evaluations on benchmark datasets demonstrate CBMAP's efficacy, offering speed, scalability, and minimal reliance on hyperparameters. Importantly, CBMAP enables low-dimensional projection of test data, addressing a critical need in machine learning applications. CBMAP is made freely available at <a class="link-external link-https" href="https://github.com/doganlab/cbmap" rel="external noopener nofollow">this https URL</a> and can be installed from the Python Package Directory (PyPI) software repository with the command pip install cbmap.
Machine Learning
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the deficiencies of existing nonlinear dimensionality reduction methods in maintaining the global structure of data and the high dependence of these methods on hyper - parameters. Specifically, popular methods such as t - SNE, UMAP, TriMap and PaCMAP, although they perform well in preserving local structure and capturing nonlinear relationships, often at the cost of sacrificing the accuracy of the global structure, which may lead to natural clusters being destroyed or dispersed in the low - dimensional space. In addition, the results of these methods are very sensitive to hyper - parameter settings, resulting in instability and unpredictability of the results. To overcome these limitations, this paper introduces a clustering - based method - CBMAP (Clustering - Based Manifold Approximation and Projection) for dimensionality reduction. CBMAP aims to maintain both global and local structures simultaneously, ensuring that the clusters in the low - dimensional space are highly similar to those in the high - dimensional space. Experimental evaluations show that CBMAP is not only fast and scalable, but also has a lower dependence on hyper - parameters, and can effectively perform low - dimensional projection on test data, meeting the key requirements of machine - learning applications.