Abstract 3520: A scalable single cell RNA-seq pipeline leveraging machine learning and high-quality references for cell-type prediction

Abhijit Dasgupta,Grant Duclos,Ricardo Miragaia,Brychan Manry,Etai Jacob,Natasha Markuzon,Asaf Rotem
DOI: https://doi.org/10.1158/1538-7445.am2024-3520
IF: 11.2
2024-03-22
Cancer Research
Abstract:Abstract Single cell RNA-seq (scRNA-seq) technology transformed our understanding of biology at the cell level. Inferring cell types provide insights into the relative abundance of, and genomic differences between, different cell types. Current methods leverage known cell type markers and genomic similarity measures to attribute cell types to groups of cells. We present a scalable machine learning-based pipeline that can leverage high quality reference annotation data to infer cell types quickly for multiple scRNA-seq experimental samples. This pipeline leverages two machine learning technologies: variational autoencoders (VAE) to create a low-dimensional representation of the reference transcriptome; and, supervised learning methods that use this representation to learn cell types from the reference data. This pipeline can quickly score new experimental data using GPU-enabled computing platforms and provide probabilities of a cell being each cell type present in the reference data. This novel pipeline is evaluated using 5 public reference data sets. We evaluate whether VAE-based representations benefit supervised learning, compared to PCA and t-SNE. We next evaluate different supervised learning methods for predictive performance. Finally, we compare the pipeline’s performance with commonly-used cell type prediction algorithms (Seurat, scANVI). We find that using a VAE is generally better than both PCA and t-SNE for feature generation to predict cell type. Using the VAE representation, there is no significant difference in accuracy between logistic regression and other supervised learning algorithms. Finally, using logistic regression as the learning machine, this pipeline is as accurate as Seurat and better than scANVI, while running about 2-5x faster than Seurat. Table 1 Balanced accuracy (%) from leave-one-subject-out cross-validation across 5 public data sets. Intervals are the range of the metric across the cross-validation folds. Algorithm PBMC (10X) HLCA Eraslan snRNA Blueprint Breast Blueprint Lung Our pipeline 92 (87-95) 92 (88-93) 93 (90-97) 89 (80-96) 97 (95-99) Seurat 96 (95-97) 90 (94-98) 94 (93-97) 96 (87-98) 98 (94-99) scANVI 91 (84-95) 72 (78-86) 92 (88-96) 94 (80-97) 88 (86-100) Citation Format: Abhijit Dasgupta, Grant Duclos, Ricardo Miragaia, Brychan Manry, Etai Jacob, Natasha Markuzon, Asaf Rotem. A scalable single cell RNA-seq pipeline leveraging machine learning and high-quality references for cell-type prediction [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 3520.
oncology
What problem does this paper attempt to address?