Abstract:Abstract Single cell RNA-seq (scRNA-seq) technology transformed our understanding of biology at the cell level. Inferring cell types provide insights into the relative abundance of, and genomic differences between, different cell types. Current methods leverage known cell type markers and genomic similarity measures to attribute cell types to groups of cells. We present a scalable machine learning-based pipeline that can leverage high quality reference annotation data to infer cell types quickly for multiple scRNA-seq experimental samples. This pipeline leverages two machine learning technologies: variational autoencoders (VAE) to create a low-dimensional representation of the reference transcriptome; and, supervised learning methods that use this representation to learn cell types from the reference data. This pipeline can quickly score new experimental data using GPU-enabled computing platforms and provide probabilities of a cell being each cell type present in the reference data. This novel pipeline is evaluated using 5 public reference data sets. We evaluate whether VAE-based representations benefit supervised learning, compared to PCA and t-SNE. We next evaluate different supervised learning methods for predictive performance. Finally, we compare the pipeline’s performance with commonly-used cell type prediction algorithms (Seurat, scANVI). We find that using a VAE is generally better than both PCA and t-SNE for feature generation to predict cell type. Using the VAE representation, there is no significant difference in accuracy between logistic regression and other supervised learning algorithms. Finally, using logistic regression as the learning machine, this pipeline is as accurate as Seurat and better than scANVI, while running about 2-5x faster than Seurat. Table 1 Balanced accuracy (%) from leave-one-subject-out cross-validation across 5 public data sets. Intervals are the range of the metric across the cross-validation folds. Algorithm PBMC (10X) HLCA Eraslan snRNA Blueprint Breast Blueprint Lung Our pipeline 92 (87-95) 92 (88-93) 93 (90-97) 89 (80-96) 97 (95-99) Seurat 96 (95-97) 90 (94-98) 94 (93-97) 96 (87-98) 98 (94-99) scANVI 91 (84-95) 72 (78-86) 92 (88-96) 94 (80-97) 88 (86-100) Citation Format: Abhijit Dasgupta, Grant Duclos, Ricardo Miragaia, Brychan Manry, Etai Jacob, Natasha Markuzon, Asaf Rotem. A scalable single cell RNA-seq pipeline leveraging machine learning and high-quality references for cell-type prediction [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 3520.

Single-cell transcriptomics for the 99.9% of species without reference genomes

Scart: Recognizing Cell Clusters and Constructing Trajectory from Single-Cell Epigenomic Data

Identification of cell types, states and programs by learning gene set representations

Single-Cell Omics for Transcriptome CHaracterization (SCOTCH): isoform-level characterization of gene expression through long-read single-cell RNA sequencing

CACIMAR: cross-species analysis of cell identities, markers, regulations, and interactions using single-cell RNA sequencing data

CACIMAR: Cross-species Analysis of Cell Identities, Markers, Regulations and Interactions Using Single-cell RNA Sequencing Data

Enhanced recovery of single-cell RNA-sequencing reads for missing gene expression data

Overloading And unpacKing (OAK) - droplet-based combinatorial indexing for ultra-high throughput single-cell multiomic profiling

Abstract 3520: A scalable single cell RNA-seq pipeline leveraging machine learning and high-quality references for cell-type prediction

Ocean to Tree: Leveraging Single-Molecule RNA-Seq to Repair Genome Gene Models and Improve Phylogenomic Analysis of Gene and Species Evolution

Metacells untangle large and complex single-cell transcriptome networks

scRNASequest: an ecosystem of scRNA-seq analysis, visualization, and publishing

Single‐cell RNA sequencing technologies and applications: A brief overview

Scemail: Universal and Source-free Annotation Method for Scrna-Seq Data with Novel Cell-type Perception.

scAnno: a deconvolution strategy-based automatic cell type annotation tool for single-cell RNA-sequencing data sets

Full-Length Transcriptome: A Reliable Alternative for Single-Cell RNA-Seq Analysis in the Spleen of Teleost Without Reference Genome

Reference-free Cell-type Annotation for Single-cell Transcriptomics using Deep Learning with a Weighted Graph Neural Network

Single cell RNA‐sequencing: A powerful yet still challenging technology to study cellular heterogeneity

Single-cell analysis of the nervous system at small and large scales with instant partitions

Microfluidics-free single-cell genomics with templated emulsification

SCInter: a comprehensive single-cell transcriptome integration database for human and mouse