Abstract:Dimension reduction (or embedding), as a popular way to visualize data, has been a fundamental technique in many applications. Non-linear dimension reduction such as t-SNE and UMAP has been widely used in visualizing single cell RNA sequencing data and metagenomic binning and thus receive many attentions in bioinformatics and computational biology. Here in this paper, we further improve UMAP-like non-linear dimension reduction algorithms by updating the graph- based nearest neighbor search algorithm (e.g. we use Hierarchical Navigable Small World Graph, or HNSW instead of K-graph) and combine several aspects of t-SNE and UMAP to create a new non-linear dimension reduction algorithm. We also provide several additional features including computation of LID (Local Intrinsic Dimension) and hubness, which can reflect structures and properties of the underlying data that strongly affect nearest neighbor search algorithm in traditional UMAP-like algorithms and thus the quality of embeddings. We also combined the improved non-linear dimension reduction algorithm with probabilistic data structures such as MinHash-likes ones (e.g., ProbMinHash et.al.) for large-scale biological sequence data visualization. Our library is called annembed and it was implemented and fully parallelized in Rust. We benchmark it against popular tools mentioned above using standard testing datasets and it showed competitive accuracy. Additionally, we apply our library in three real-world problems: visualizing large-scale microbial genomic database, visualizing single cell RNA sequencing data and metagenomic binning, to showcase the performance, scalability and efficiency of the library when distance computation is expensive or when the number of data points is large (e.g. millions or billions). Annembed can be found here:

Visualizing single-cell data with the neighbor embedding spectrum

Attraction-Repulsion Spectrum in Neighbor Embeddings

The art of seeing the elephant in the room: 2D embeddings of single-cell data do make sense

Automatic identification of relevant genes from low-dimensional embeddings of single-cell RNA-seq data.

Visualizing and interpreting single-cell gene expression datasets with Similarity Weighted Nonnegative Embedding

See without looking: joint visualization of sensitive multi-site datasets

ENS-t-SNE: Embedding Neighborhoods Simultaneously t-SNE

SGEN: Single-cell Sequencing Graph Self-supervised Embedding Network

Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets

Spectral clustering of single-cell multi-omics data on multilayer graphs

A General Framework for Comparing Embedding Visualizations Across Class-Label Hierarchies

Graph Representation Learning for Single-Cell Biology

Statistical embedding: Beyond principal components

Visualizing Differences of DTI Fiber Models Using 2D Normalized Embeddings.

Compound-SNE: Comparative alignment of t-SNEs for multiple single-cell omics data visualisation

The art of using t-SNE for single-cell transcriptomics

Approximate Nearest Neighbor Graph Provides Fast and Efficient Embedding with Applications in Large-scale Biological Data

Online t-SNE for single-cell RNA-seq

Assessing and improving reliability of neighbor embedding methods: a map-continuity perspective

Mapping the gene space at single-cell resolution with gene signal pattern analysis