Objectively Evaluating the Reliability of Cell Type Annotation Using LLM-Based Strategies

Wenjin Ye,Yuanchen Ma,Junkai Xiang,Hongjie Liang,Tao Wang,Qiuling Xiang,Andy Peng Xiang,Wu Song,Weiqiang Li,Weijun Huang
2024-09-24
Abstract:Reliability in cell type annotation is challenging in single-cell RNA-sequencing data analysis because both expert-driven and automated methods can be biased or constrained by their training data, especially for novel or rare cell types. Although large language models (LLMs) are useful, our evaluation found that only a few matched expert annotations due to biased data sources and inflexible training inputs. To overcome these limitations, we developed the LICT (Large language model-based Identifier for Cell Types) software package using a multi-model fusion and "talk-to-machine" strategy. Tested across various single-cell RNA sequencing datasets, our approach significantly improved annotation reliability, especially in datasets with low cellular heterogeneity. Notably, we established objective criteria to assess annotation reliability using the "talk-to-machine" approach, which addresses discrepancies between our annotations and expert ones, enabling reliable evaluation even without reference data. This strategy enhances annotation credibility and sets the stage for advancing future LLM-based cell type annotation methods.
Quantitative Methods,Genomics
What problem does this paper attempt to address?
The paper attempts to address the reliability issue of cell type annotation in single-cell RNA sequencing data. Specifically, it points out that both expert-driven and automated methods may have biases or be limited by training data when performing cell type annotation, especially when dealing with novel or rare cell types. Although large language models (LLMs) have potential in cell type annotation, existing LLMs often fail to accurately match expert annotations due to data source biases and rigid input formats. To address these issues, the research team developed a software package called LICT (Large Language Model-based Cell Type Identifier), which employs a multi-model fusion strategy and a "dialogue with the machine" approach. Through these strategies, LICT significantly improves annotation reliability across various single-cell RNA sequencing datasets, particularly in datasets with low cell heterogeneity. Additionally, the paper establishes an objective standard to evaluate annotation reliability, enabling reliable assessment even in the absence of reference data. These strategies not only enhance the credibility of annotations but also lay the foundation for future LLM-based cell type annotation methods.