Abstract:Visualizing high-dimensional data is an important routine for understanding biomedical data and interpreting deep learning models. Neighbor embedding methods, such as t-SNE, UMAP, and LargeVis, among others, are a family of popular visualization methods which reduce high-dimensional data to two dimensions. However, recent studies suggest that these methods often produce visual artifacts, potentially leading to incorrect scientific conclusions. Recognizing that the current limitation stems from a lack of data-independent notions of embedding maps, we introduce a novel conceptual and computational framework, LOO-map, that learns the embedding maps based on a classical statistical idea known as the leave-one-out. LOO-map extends the embedding over a discrete set of input points to the entire input space, enabling a systematic assessment of map continuity, and thus the reliability of the visualizations. We find for many neighbor embedding methods, their embedding maps can be intrinsically discontinuous. The discontinuity induces two types of observed map distortion: ``overconfidence-inducing discontinuity," which exaggerates cluster separation, and ``fracture-inducing discontinuity," which creates spurious local structures. Building upon LOO-map, we propose two diagnostic point-wise scores -- perturbation score and singularity score -- to address these limitations. These scores can help identify unreliable embedding points, detect out-of-distribution data, and guide hyperparameter selection. Our approach is flexible and works as a wrapper around many neighbor embedding algorithms. We test our methods across multiple real-world datasets from computer vision and single-cell omics to demonstrate their effectiveness in enhancing the interpretability and accuracy of visualizations.

Interpretable Embedding and Visualization of Compressed Data

An Interpretable Data Embedding under Uncertain Distance Information.

Deep Manifold Computing and Visualization

FIDE: Fast and Interpretable 2D Embedding with Correlation, Distance, and Rank Considerations.

Deep Manifold Computing and Visualization Using Elastic Locally Isometric Smoothness

Compressive mining: fast and optimal data mining in the compressed domain

A Unified Framework for Jointly Compressing Visual and Semantic Data

Revisit Visual Representation in Analytics Taxonomy: A Compression Perspective

An Eigenshapes Approach to Compressed Signed Distance Fields and Their Utility in Robot Mapping

Interpretable Learned Image Compression: A Frequency Transform Decomposition Perspective

A Computational Approach to Interpreting the Embedding Space of Dimension Reduction

Optimal Distance Estimation Between Compressed Data Series.

Statistical embedding: Beyond principal components

A General Framework for Comparing Embedding Visualizations Across Class-Label Hierarchies

See without looking: joint visualization of sensitive multi-site datasets

Compressive spectral embedding: sidestepping the SVD

LEt-SNE: A Hybrid Approach To Data Embedding and Visualization of Hyperspectral Imagery

Spectral/spatial Hyperspectral Image Compression in Conjunction with Virtual Dimensionality

Assessing and improving reliability of neighbor embedding methods: a map-continuity perspective

Efficient Neural Representation of Volumetric Data using Coordinate-Based Networks

An Interpretable Compression and Classification System: Theory and Applications