Abstract:Single-cell RNA sequencing (scRNA-seq) has been widely used in cancer research to understand the complex gene expression diversity and cancer heterogeneity. However, manual annotation of cell types in the scRNA-seq pipeline is time-consuming and depends on the expertise of analyzers, which can significantly influence the results of downstream analyses. To address this problem, we proposed a novel machine learning framework utilizing the LightGBM model for automated and efficient cell-type annotation of scRNA-seq. Two independent scRNA-seq datasets of non-small cell lung cancer (NSCLC) downloaded from the Gene Expression Omnibus (GEO) were used to train and test our model. A standard procedure is applied to both scRNA-seq datasets for quality control and preprocessing, in which poor-quality cells with low gene expressions or high scores for cellular stress/death were excluded. In addition, Harmony is applied to mitigate batch effects in scRNA-seq that could cause variability due to non-biological factors in experiments. Nine different cell types, including endothelial, epithelial, fibroblast, macrophages, mast, plasma, pulmonary alveolar, B, and T cells, were manually labeled in the two datasets by the providers, which were also examined using gene markers corresponding to different cell types from PanglaoDB and DAVID. These manually labeled cell types were used as the ground truth for training and testing our model. In the training stage, the training dataset (containing 85,000 cells from 44 NSCLC samples) of scRNA-seq was used to train the LightGBM model with its high-variable genes. Then, the model would be evaluated using an independent test dataset (containing 8,000 cells from 18 NSCLC samples) by comparing the automatically predicted and manually labeled cell types. The training result showed that our model could successfully specify the nine different cell types, achieving an overall average accuracy, F1 score, and precision of 0.86 each respectively. In the independent dataset test, the model demonstrated good generalization, showing high predictive performance across all cell types, with an average accuracy, F1 score, and precision of 0.8, 0.78, and 0.8, respectively. Specific to the predictions in the test dataset, we found that some epithelial cells were mistakenly identified as other cell types. This might be because of the complex gene expression patterns exhibited by tumor epithelial cells, making accurate predictions challenging. The proposed machine learning framework facilitates cell labeling and unravels the intricate heterogeneity within lung cancer datasets. The combination of LightGBM and standardized preprocessing establishes a benchmark for high-throughput, accurate single-cell analysis, paving the way for discoveries that are more targeted and have significant clinical impact. Citation Format: Tsung Hsien Chuang, Liang-Chuan Lai, Tzu-Pin Lu, Mong-Hsun Tsai, Hsiang-Han Chen, Eric Y. Chuang. Enhancing single-cell RNA sequencing analysis in cancer research: A machine learning framework based on LightGBM for automated cell type annotation [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular s); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl) nr 878.

Identification of kidney cell types in scRNA-seq and snRNA-seq data using machine learning algorithms

Understanding the Biology and Pathogenesis of the Kidney by Single-Cell Transcriptomic Analysis

A single-nucleus RNA-sequencing pipeline to decipher the molecular anatomy and pathophysiology of human kidneys

In Situ Classification of Cell Types in Human Kidney Tissue Using 3D Nuclear Staining

Novel Human Kidney Cell Subsets Identified by Mux-Seq

Single-cell RNA Sequencing Reveals the Mesangial Identity and Species Diversity of Glomerular Cell Transcriptomes.

Urinary Single-Cell Profiling Captures the Cellular Diversity of the Kidney

Abstract 878: Enhancing single-cell RNA sequencing analysis in cancer research: A machine learning framework based on LightGBM for automated cell type annotation

Integrated cytometry with machine learning applied to high-content imaging of human kidney tissue for in-situ cell classification and neighborhood analysis

Single Cell Sequencing and Kidney Organoids Generated from Pluripotent Stem Cells

Smartphone-based machine learning model for real-time assessment of medical kidney biopsy

A Multimodal and Integrated Approach to Interrogate Human Kidney Biopsies with Rigor and Reproducibility: The Kidney Precision Medicine Project

A Multimodal and Integrated Approach to Interrogate Human Kidney Biopsies with Rigor and Reproducibility: Guidelines from the Kidney Precision Medicine Project

The Advances of Single-Cell RNA-Seq in Kidney Immunology

"Hi, how can i help you?": embracing artificial intelligence in kidney research

Multi-omic single cell sequencing: Overview and opportunities for kidney disease therapeutic development

Deep learning-enabled classification of kidney allograft rejection on whole slide histopathologic images

Deep learning-based classification of kidney transplant pathology: a retrospective, multicentre, proof-of-concept study

Unbiased kidney-centric molecular categorization of chronic kidney disease as a step towards precision medicine

SCINA: Semi-Supervised Analysis of Single Cells in silico

Representation and relative abundance of cell-type selective markers in whole-kidney RNA-Seq data