GeneLLM: A Large cfRNA Language Model for Cancer Screening from Raw Reads

Siwei Deng,Lei Sha,Yongcheng Jin,Tianyao Zhou,Chengen Wang,Qianpu Liu,Hongjie Guo,Chengjie Xiong,Yangtao Xue,Xiaoguang Li,Yuanming Li,Yaping Gao,Mengyu Hong,Junjie Xu,Shanwen Chen,Pengyuan Wang
DOI: https://doi.org/10.1101/2024.06.29.601341
2024-07-02
Abstract:Plasma cell-free RNA (cfRNA) has recently emerged as a promising biomarker for non-invasive early cancer detection and treatment monitoring [ , , , , , ]. Here, we introduce GeneLLM, a novel large language model de-signed to interpret cfRNA sequences directly, bypassing the need for genome annotations. GeneLLM significantly advances the detection accuracy of various cancer types. Our study demonstrates that this method achieves higher accuracy than traditional biomarkers and effectively handles large datasets from different centres, even with low sequencing depth. By avoiding the use of bioinformatics tools to count known genes, GeneLLM also discovered cfRNAs from previously unknown genes, referred to as ‘dark matters’ in the genome, as cancer detec-tion ‘pseudo-biomarkers’. Our results showcase the potential of GeneLLM to revolutionise cancer detection, making it more accessible and cost-effective. By offering a method that does not depend on bioinformatics tools to count known genes, GeneLLM opens new avenues for biomarker discovery and enhances our understanding of intercellular communication through novel RNA molecules.
Bioinformatics
What problem does this paper attempt to address?