Abstract:We propose a pair of completely data-driven algorithms for unsupervised classification and dimension reduction, and we empirically study their performance on a number of data sets, both simulated data in three-dimensions and images from the COIL-20 data set. The algorithms take as input a set of points sampled from a uniform distribution supported on a metric space, the latter embedded in an ambient metric space, and they output a clustering or reduction of dimension of the data. They work by constructing a natural family of graphs from the data and selecting the graph which maximizes the relative von Neumann entropy of certain normalized heat operators constructed from the graphs. Once the appropriate graph is selected, the eigenvectors of the graph Laplacian may be used to reduce the dimension of the data, and clusters in the data may be identified with the kernel of the associated graph Laplacian. Notably, these algorithms do not require information about the size of a neighborhood or the desired number of clusters as input, in contrast to popular algorithms such as $k$-means, and even more modern spectral methods such as Laplacian eigenmaps, among others. In our computational experiments, our clustering algorithm outperforms $k$-means clustering on data sets with non-trivial geometry and topology, in particular data whose clusters are not concentrated around a specific point, and our dimension reduction algorithm is shown to work well in several simple examples.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are the key challenges in unsupervised data clustering and dimension reduction. Specifically: 1. **Unsupervised classification (clustering) problem**: How to divide data points into different clusters according to their intrinsic geometric structure without label information. Traditional methods such as k - means rely on user - input parameters (such as the number of clusters), which makes these methods not fully data - driven. 2. **Dimension reduction problem**: How to reduce the dimension of data while maintaining the local structure of data. Many existing dimension - reduction algorithms also require users to specify some parameters, such as neighborhood size or the number of features. ### Main contributions of the paper To solve the above problems, the author proposes a new method based on Relative von Neumann Entropy to select the optimal graph model, thereby achieving fully data - driven clustering and dimension reduction. The advantages of this method are: - **No need for manual parameter setting**: It does not require users to provide information about neighborhood size or the expected number of clusters. - **Adapt to data with complex geometric structures**: It performs excellently on data sets with non - trivial geometric and topological structures, especially when clusters are not concentrated at a specific point. ### Method overview 1. **Construct a weighted graph**: For each scale $ r $, construct a weighted graph $ G_r $, where the vertices are data points and the edge weights are determined by the distances between points. 2. **Calculate the Relative von Neumann Entropy**: By calculating the Relative von Neumann Entropy between heat operators at different time steps, select the graph model that maximizes this entropy. 3. **Clustering and dimension reduction**: - **Clustering**: Use the kernel space of the Laplacian matrix of the selected graph to identify clusters. - **降维**: Use the eigenvectors of the Laplacian matrix of the selected graph to construct a mapping from high - dimensional to low - dimensional. ### Experimental results The author verifies the effectiveness of the proposed method through experiments, especially when dealing with synthetic data and the COIL - 20 image database. Compared with the traditional k - means algorithm and other spectral methods, the new method shows better performance. In summary, this paper aims to provide a fully data - driven and unsupervised clustering and dimension - reduction method suitable for complex data structures by introducing Relative von Neumann Entropy as a model - selection criterion.

Noncommutative Model Selection for Data Clustering and Dimension Reduction Using Relative von Neumann Entropy

Adaptive Dimension Reduction for Clustering High Dimensional Data

A new model for natural groupings in high-dimensional data

Dimension reduction and the gradient flow of relative entropy

Randomized Dimensionality Reduction for k-means Clustering

Clustering Based on Pairwise Distances When the Data is of Mixed Dimensions

Local Nonlinear Dimensionality Reduction via Preserving the Geometric Structure of Data

Dimensionality-reduced subspace clustering

PCA-KL: a parametric dimensionality reduction approach for unsupervised metric learning

A Local Similarity-Preserving Framework for Nonlinear Dimensionality Reduction with Neural Networks

Uniqueness of Non-Gaussianity-Based Dimension Reduction

Dimension reduction for covariates in network data

Sufficient Dimensionality Reduction with Irrelevant Statistics

High-dimensional logistic entropy clustering

On Probabilistic Embeddings in Optimal Dimension Reduction

Noncommutative Model Selection and the Data-Driven Estimation of Real Cohomology Groups

Automatic Parameter Selection for Non-Redundant Clustering

Measures of Entropy From Data Using Infinitely Divisible Kernels

The Exploitation of Distance Distributions for Clustering

A Dimensionality Reduction and Reconstruction Method for Data with Multiple Connected Components

Non-Redundant Subspace Clusterings with Nr-Kmeans and Nr-DipMeans