Abstract:Abstract Single cell RNA-seq (scRNA-seq) technology transformed our understanding of biology at the cell level. Inferring cell types provide insights into the relative abundance of, and genomic differences between, different cell types. Current methods leverage known cell type markers and genomic similarity measures to attribute cell types to groups of cells. We present a scalable machine learning-based pipeline that can leverage high quality reference annotation data to infer cell types quickly for multiple scRNA-seq experimental samples. This pipeline leverages two machine learning technologies: variational autoencoders (VAE) to create a low-dimensional representation of the reference transcriptome; and, supervised learning methods that use this representation to learn cell types from the reference data. This pipeline can quickly score new experimental data using GPU-enabled computing platforms and provide probabilities of a cell being each cell type present in the reference data. This novel pipeline is evaluated using 5 public reference data sets. We evaluate whether VAE-based representations benefit supervised learning, compared to PCA and t-SNE. We next evaluate different supervised learning methods for predictive performance. Finally, we compare the pipeline’s performance with commonly-used cell type prediction algorithms (Seurat, scANVI). We find that using a VAE is generally better than both PCA and t-SNE for feature generation to predict cell type. Using the VAE representation, there is no significant difference in accuracy between logistic regression and other supervised learning algorithms. Finally, using logistic regression as the learning machine, this pipeline is as accurate as Seurat and better than scANVI, while running about 2-5x faster than Seurat. Table 1 Balanced accuracy (%) from leave-one-subject-out cross-validation across 5 public data sets. Intervals are the range of the metric across the cross-validation folds. Algorithm PBMC (10X) HLCA Eraslan snRNA Blueprint Breast Blueprint Lung Our pipeline 92 (87-95) 92 (88-93) 93 (90-97) 89 (80-96) 97 (95-99) Seurat 96 (95-97) 90 (94-98) 94 (93-97) 96 (87-98) 98 (94-99) scANVI 91 (84-95) 72 (78-86) 92 (88-96) 94 (80-97) 88 (86-100) Citation Format: Abhijit Dasgupta, Grant Duclos, Ricardo Miragaia, Brychan Manry, Etai Jacob, Natasha Markuzon, Asaf Rotem. A scalable single cell RNA-seq pipeline leveraging machine learning and high-quality references for cell-type prediction [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 3520.

Beyond benchmarking: towards predictive models of dataset-specific single-cell RNA-seq pipeline performance

Beyond benchmarking and towards predictive models of dataset-specific single-cell RNA-seq pipeline performance

Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments

scRNA-seq mixology: towards better benchmarking of single cell RNA-seq analysis methods

A systematic evaluation of single cell RNA-seq analysis pipelines

Benchmarking UMI-based single-cell RNA-seq preprocessing workflows

IBRAP: Integrated Benchmarking Single-cell RNA-sequencing Analytical Pipeline

Abstract 3520: A scalable single cell RNA-seq pipeline leveraging machine learning and high-quality references for cell-type prediction

Benchmarking UMI-based single cell RNA-sequencing preprocessing workflows

A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples

Benchmarking algorithms for single-cell multi-omics prediction and integration

The shaky foundations of simulating single-cell RNA sequencing data

Optimal distance metrics for single-cell RNA-seq populations

Benchmarking Algorithms for Pathway Activity Transformation of Single-Cell RNA-seq Data

A systematic evaluation of single-cell RNA-sequencing imputation methods

Evaluation of single-cell classifiers for single-cell RNA sequencing data sets

A comparison of automatic cell identification methods for single-cell RNA sequencing data

Benchmarking scRNA-seq copy number variation callers

Comprehensive Evaluation of Noise Reduction Methods for Single-Cell RNA Sequencing Data

Practical bioinformatics pipelines for single-cell RNA-seq data analysis

From G1 to M: a comparative study of methods for identifying cell cycle phases