Transformer‐based representation learning and multiple‐instance learning for cancer diagnosis exclusively from raw sequencing fragments of bisulfite‐treated plasma cell‐free DNA

Jilei Liu,Hongru Shen,Yichen Yang,Meng Yang,Qiang Zhang,Kexin Chen,Xiangchun Li
DOI: https://doi.org/10.1002/1878-0261.13745
2024-10-10
Molecular Oncology
Abstract:This study introduces DECIDIA, a deep‐learning approach for early cancer diagnosis using bisulfite‐treated cfDNA sequencing fragments. Utilizing transformer‐based representation learning and weakly supervised multiple‐instance learning, DECIDIA accurately detects cancer and predicts cancer types, significantly simplifying data analysis. By offering an end‐to‐end solution for liquid biopsy‐based diagnostics, DECIDIA advances the potential for early, non‐invasive cancer interception. Early cancer diagnosis from bisulfite‐treated cell‐free DNA (cfDNA) fragments requires tedious data analytical procedures. Here, we present a deep‐learning‐based approach for early cancer interception and diagnosis (DECIDIA) that can achieve accurate cancer diagnosis exclusively from bisulfite‐treated cfDNA sequencing fragments. DECIDIA relies on transformer‐based representation learning of DNA fragments and weakly supervised multiple‐instance learning for classification. We systematically evaluate the performance of DECIDIA for cancer diagnosis and cancer type prediction on a curated dataset of 5389 samples that consist of colorectal cancer (CRC; n = 1574), hepatocellular cell carcinoma (HCC; n = 1181), lung cancer (n = 654), and non‐cancer control (n = 1980). DECIDIA achieved an area under the receiver operating curve (AUROC) of 0.980 (95% CI, 0.976–0.984) in 10‐fold cross‐validation settings on the CRC dataset by differentiating cancer patients from cancer‐free controls, outperforming benchmarked methods that are based on methylation intensities. Noticeably, DECIDIA achieved an AUROC of 0.910 (95% CI, 0.896–0.924) on the externally independent HCC testing set in distinguishing HCC patients from cancer‐free controls, although there was no HCC data used in model development. In the settings of cancer‐type classification, we observed that DECIDIA achieved a micro‐average AUROC of 0.963 (95% CI, 0.960–0.966) and an overall accuracy of 82.8% (95% CI, 81.8–83.9). In addition, we distilled four sequence signatures from the raw sequencing reads that exhibited differential patterns in cancer versus control and among different cancer types. Our approach represents a new paradigm towards eliminating the tedious data analytical procedures for liquid biopsy that uses bisulfite‐treated cfDNA methylome.
oncology
What problem does this paper attempt to address?