Abstract:Single-cell RNA sequencing (scRNA-seq) has been widely used in cancer research to understand the complex gene expression diversity and cancer heterogeneity. However, manual annotation of cell types in the scRNA-seq pipeline is time-consuming and depends on the expertise of analyzers, which can significantly influence the results of downstream analyses. To address this problem, we proposed a novel machine learning framework utilizing the LightGBM model for automated and efficient cell-type annotation of scRNA-seq. Two independent scRNA-seq datasets of non-small cell lung cancer (NSCLC) downloaded from the Gene Expression Omnibus (GEO) were used to train and test our model. A standard procedure is applied to both scRNA-seq datasets for quality control and preprocessing, in which poor-quality cells with low gene expressions or high scores for cellular stress/death were excluded. In addition, Harmony is applied to mitigate batch effects in scRNA-seq that could cause variability due to non-biological factors in experiments. Nine different cell types, including endothelial, epithelial, fibroblast, macrophages, mast, plasma, pulmonary alveolar, B, and T cells, were manually labeled in the two datasets by the providers, which were also examined using gene markers corresponding to different cell types from PanglaoDB and DAVID. These manually labeled cell types were used as the ground truth for training and testing our model. In the training stage, the training dataset (containing 85,000 cells from 44 NSCLC samples) of scRNA-seq was used to train the LightGBM model with its high-variable genes. Then, the model would be evaluated using an independent test dataset (containing 8,000 cells from 18 NSCLC samples) by comparing the automatically predicted and manually labeled cell types. The training result showed that our model could successfully specify the nine different cell types, achieving an overall average accuracy, F1 score, and precision of 0.86 each respectively. In the independent dataset test, the model demonstrated good generalization, showing high predictive performance across all cell types, with an average accuracy, F1 score, and precision of 0.8, 0.78, and 0.8, respectively. Specific to the predictions in the test dataset, we found that some epithelial cells were mistakenly identified as other cell types. This might be because of the complex gene expression patterns exhibited by tumor epithelial cells, making accurate predictions challenging. The proposed machine learning framework facilitates cell labeling and unravels the intricate heterogeneity within lung cancer datasets. The combination of LightGBM and standardized preprocessing establishes a benchmark for high-throughput, accurate single-cell analysis, paving the way for discoveries that are more targeted and have significant clinical impact. Citation Format: Tsung Hsien Chuang, Liang-Chuan Lai, Tzu-Pin Lu, Mong-Hsun Tsai, Hsiang-Han Chen, Eric Y. Chuang. Enhancing single-cell RNA sequencing analysis in cancer research: A machine learning framework based on LightGBM for automated cell type annotation [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular s); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl) nr 878.

Objectively Evaluating the Reliability of Cell Type Annotation Using LLM-Based Strategies

Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data

CASSIA allows for robust, automated cell annotation in single-cell RNA-sequencing data

Revolutionizing Single Cell Analysis: The Power of Large Language Models for Cell Type Annotation

scMMT: a multi-use deep learning approach for cell annotation, protein prediction and embedding in single-cell RNA-seq data

ChatCell: Facilitating Single-Cell Analysis with Natural Language

Artificial Intelligence in Cell Annotation for High-Resolution RNA Sequencing Data

A machine learning-based method for automatically identifying novel cells in annotating single-cell RNA-seq data

TripletCell: a deep metric learning framework for accurate annotation of cell types at the single-cell level

Automated cell annotation in scRNA-seq data using unique marker gene sets

scAnno: a deconvolution strategy-based automatic cell type annotation tool for single-cell RNA-sequencing data sets

How do Large Language Models understand Genes and Cells

VICTOR: Validation and inspection of cell type annotation through optimal regression

Abstract 878: Enhancing single-cell RNA sequencing analysis in cancer research: A machine learning framework based on LightGBM for automated cell type annotation

EasyCellType: marker-based cell-type annotation by automatically querying multiple databases

scInterpreter: Training Large Language Models to Interpret scRNA-seq Data for Cell Type Annotation

The impacts of active and self-supervised learning on efficient annotation of single-cell expression data

Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive Learning

CALLR: a semi-supervised cell-type annotation method for single-cell RNA sequencing data

Imbalance and Composition Correction Ensemble Learning Framework (ICCELF): A novel framework for automated scRNA-seq cell type annotation

Realistic Cell Type Annotation and Discovery for Single-cell RNA-seq Data