Abstract:Abstract Single cell RNA-seq (scRNA-seq) technology transformed our understanding of biology at the cell level. Inferring cell types provide insights into the relative abundance of, and genomic differences between, different cell types. Current methods leverage known cell type markers and genomic similarity measures to attribute cell types to groups of cells. We present a scalable machine learning-based pipeline that can leverage high quality reference annotation data to infer cell types quickly for multiple scRNA-seq experimental samples. This pipeline leverages two machine learning technologies: variational autoencoders (VAE) to create a low-dimensional representation of the reference transcriptome; and, supervised learning methods that use this representation to learn cell types from the reference data. This pipeline can quickly score new experimental data using GPU-enabled computing platforms and provide probabilities of a cell being each cell type present in the reference data. This novel pipeline is evaluated using 5 public reference data sets. We evaluate whether VAE-based representations benefit supervised learning, compared to PCA and t-SNE. We next evaluate different supervised learning methods for predictive performance. Finally, we compare the pipeline’s performance with commonly-used cell type prediction algorithms (Seurat, scANVI). We find that using a VAE is generally better than both PCA and t-SNE for feature generation to predict cell type. Using the VAE representation, there is no significant difference in accuracy between logistic regression and other supervised learning algorithms. Finally, using logistic regression as the learning machine, this pipeline is as accurate as Seurat and better than scANVI, while running about 2-5x faster than Seurat. Table 1 Balanced accuracy (%) from leave-one-subject-out cross-validation across 5 public data sets. Intervals are the range of the metric across the cross-validation folds. Algorithm PBMC (10X) HLCA Eraslan snRNA Blueprint Breast Blueprint Lung Our pipeline 92 (87-95) 92 (88-93) 93 (90-97) 89 (80-96) 97 (95-99) Seurat 96 (95-97) 90 (94-98) 94 (93-97) 96 (87-98) 98 (94-99) scANVI 91 (84-95) 72 (78-86) 92 (88-96) 94 (80-97) 88 (86-100) Citation Format: Abhijit Dasgupta, Grant Duclos, Ricardo Miragaia, Brychan Manry, Etai Jacob, Natasha Markuzon, Asaf Rotem. A scalable single cell RNA-seq pipeline leveraging machine learning and high-quality references for cell-type prediction [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 3520.

V-SVA: an R Shiny application for detecting and annotating hidden sources of variation in single-cell RNA-seq data

scViewer: An Interactive Single-Cell Gene Expression Visualization Tool

De Novo Identification of Expressed Cancer Somatic Mutations from Single-Cell RNA Sequencing Data

scSNViz: a user-friendly toolset for visualization and analysis of Cell-Specific Expressed SNVs

scSVA: an interactive tool for big data visualization and exploration in single-cell omics

Scvi-Tools: a Library for Deep Probabilistic Analysis of Single-Cell Omics Data

SCI-VCF: A cross-platform GUI solution to Summarise, Compare, Inspect, and Visualise the Variant Call Format

SCHNAPPs - Single Cell sHiNy APPlication(s)

scX: a user-friendly tool for scRNAseq exploration

IVAG: An Integrative Visualization Application for Various Types of Genomic Data Based on R-Shiny and the Docker Platform

scX: A user-friendly tool for scRNA-seq exploration

scRNA-Explorer: An End-user Online Tool for Single Cell RNA-seq Data Analysis Featuring Gene Correlation and Data Filtering

Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R

scQCEA: a framework for annotation and quality control report of single-cell RNA-sequencing data

SingleCAnalyzer: Interactive Analysis of Single Cell RNA-Seq Data on the Cloud

SAVER: gene expression recovery for single-cell RNA sequencing

SCALA: A complete solution for multimodal analysis of single-cell Next Generation Sequencing data

RNfuzzyApp: an R shiny RNA-seq data analysis app for visualisation, differential expression analysis, time-series clustering and enrichment analysis

GSVA: gene set variation analysis for microarray and RNA-Seq data

Abstract 3520: A scalable single cell RNA-seq pipeline leveraging machine learning and high-quality references for cell-type prediction

scRNASequest: an ecosystem of scRNA-seq analysis, visualization, and publishing