Abstract:Dimensionality reduction tools like t-SNE and UMAP are widely used for high-dimensional data analysis. For instance, these tools are applied in biology to describe spiking patterns of neuronal populations or the genetic profiles of different cell types. Here, we show that when data include noise points that are randomly scattered within a high-dimensional space, a "scattering noise problem" occurs in the low-dimensional embedding where noise points overlap with the cluster points. We show that a simple transformation of the original distance matrix by computing a distance between neighbor distances alleviates this problem and identifies the noise points as a separate cluster. We apply this technique to high-dimensional neuronal spike sequences, as well as the representations of natural images by convolutional neural network units, and find an improvement in the constructed low-dimensional embedding. Thus, we present an improved dimensionality reduction technique for high-dimensional data containing noise points. Biological datasets are often high-dimensional, e.g. the genetic profile of cells or the firing pattern of neural populations. Dimensionality reduction methods like t-SNE are commonly used to represent the high-dimensional data in a low-dimensional embedding space. The visualization helps us to identify the underlying clustering patterns and shed light on the information hidden within the data. We show that in situations where there exist scattering noise points, clustering patterns in the data tend to be heavily distorted. Here, we show that using a distance-of-distance (DoD) transformation of the dissimilarity matrix between data points, the influence of scattering noise is effectively removed. This neighborhood-based transformation is most effective when the dimensionality of the dataset is high. We show that this technique improves low-dimensional embedding for several high-dimensional datasets, such as the convolutional neural network representation of natural images or the neuronal population representation of visual stimuli.

Learning the Distribution of Data for Embedding

Locality Pursuit Embedding

Discriminant Neighborhood Embedding for Classification.

Supervised locality pursuit embedding for pattern classification

Optimal Dimensionality of Metric Space for Classification

Learning to Embed Distributions via Maximum Kernel Entropy

Neighbourhood Sensitive Preserving Embedding for Pattern Classification.

Deep Recursive Embedding for High-Dimensional Data

Metric Distribution to Vector: Constructing Data Representation via Broad-Scale Discrepancies

Graph Embedding with Constraints

Learning in High-Dimensional Multimedia Data: the State of the Art

Discriminative sparse embedding based on adaptive graph for dimension reduction

Dimensionality Reduction By Using Sparse Reconstruction Embedding

Dimensionality Reduction by T-Distribution Adaptive Manifold Embedding.

Global structure-guided neighborhood preserving embedding for dimensionality reduction

Feature Space Distance Metric Learning for Discriminant Graph Embedding

Maximal Linear Embedding for Dimensionality Reduction

Signed Laplacian Embedding for Supervised Dimension Reduction.

Improved visualization of high-dimensional data using the distance-of-distance transformation

Constrained discriminant neighborhood embedding for high dimensional data feature extraction.

A New Approach to Discover Interlacing Data Structures in High-Dimensional Space