Abstract:Abstract Single cell RNA-seq (scRNA-seq) technology transformed our understanding of biology at the cell level. Inferring cell types provide insights into the relative abundance of, and genomic differences between, different cell types. Current methods leverage known cell type markers and genomic similarity measures to attribute cell types to groups of cells. We present a scalable machine learning-based pipeline that can leverage high quality reference annotation data to infer cell types quickly for multiple scRNA-seq experimental samples. This pipeline leverages two machine learning technologies: variational autoencoders (VAE) to create a low-dimensional representation of the reference transcriptome; and, supervised learning methods that use this representation to learn cell types from the reference data. This pipeline can quickly score new experimental data using GPU-enabled computing platforms and provide probabilities of a cell being each cell type present in the reference data. This novel pipeline is evaluated using 5 public reference data sets. We evaluate whether VAE-based representations benefit supervised learning, compared to PCA and t-SNE. We next evaluate different supervised learning methods for predictive performance. Finally, we compare the pipeline’s performance with commonly-used cell type prediction algorithms (Seurat, scANVI). We find that using a VAE is generally better than both PCA and t-SNE for feature generation to predict cell type. Using the VAE representation, there is no significant difference in accuracy between logistic regression and other supervised learning algorithms. Finally, using logistic regression as the learning machine, this pipeline is as accurate as Seurat and better than scANVI, while running about 2-5x faster than Seurat. Table 1 Balanced accuracy (%) from leave-one-subject-out cross-validation across 5 public data sets. Intervals are the range of the metric across the cross-validation folds. Algorithm PBMC (10X) HLCA Eraslan snRNA Blueprint Breast Blueprint Lung Our pipeline 92 (87-95) 92 (88-93) 93 (90-97) 89 (80-96) 97 (95-99) Seurat 96 (95-97) 90 (94-98) 94 (93-97) 96 (87-98) 98 (94-99) scANVI 91 (84-95) 72 (78-86) 92 (88-96) 94 (80-97) 88 (86-100) Citation Format: Abhijit Dasgupta, Grant Duclos, Ricardo Miragaia, Brychan Manry, Etai Jacob, Natasha Markuzon, Asaf Rotem. A scalable single cell RNA-seq pipeline leveraging machine learning and high-quality references for cell-type prediction [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 3520.

Sctab: Scaling Cross-Tissue Single-Cell Annotation Models

Scart: Recognizing Cell Clusters and Constructing Trajectory from Single-Cell Epigenomic Data

Scgat: A Cell-Type Annotation Framework for Single-Cell Transcriptomics Using Graph Attention Network and Meta Learning

Scgen Predicts Single-Cell Perturbation Responses

TripletCell: a deep metric learning framework for accurate annotation of cell types at the single-cell level

Cell type matching across species using protein embeddings and transfer learning

scARE: Attribution Regularization for Single Cell Representation Learning

A sandbox for prediction and integration of DNA, RNA, and proteins in single cells

A self-training interpretable cell type annotation framework using specific marker gene

scATAcat: Cell-type annotation for scATAC-seq data

Comprehensive Integration of Single-Cell Data

Abstract 3520: A scalable single cell RNA-seq pipeline leveraging machine learning and high-quality references for cell-type prediction

EnClaSC: a novel ensemble approach for accurate and robust cell-type classification of single-cell transcriptomes

scATAnno: Automated Cell Type Annotation for single-cell ATAC Sequencing Data

Deciphering cell types by integrating scATAC-seq data with genome sequences

scGAA: a general gated axial-attention model for accurate cell-type annotation of single-cell RNA-seq data

Joint cell type identification in spatial transcriptomics and single-cell RNA sequencing data

CTEC: a cross-tabulation ensemble clustering approach for single-cell RNA sequencing data analysis

Contrastive Learning for Robust Cell Annotation and Representation from Single-Cell Transcriptomics

scSwinTNet: A Cell Type Annotation Method for Large-Scale Single-Cell RNA-Seq Data Based on Shifted Window Attention

Predicting cell types with supervised contrastive learning on cells and their types