Transformer-based deep learning model for the diagnosis of suspected lung cancer in primary care based on electronic health record data

Lan Wang,Yonghua Yin,Ben Glampson,Robert L Peach,Mauricio Barahona,Brendan C Delaney,Erik Mayer
DOI: https://doi.org/10.1101/2024.07.02.24309824
2024-07-05
Abstract:Background Due to its late stage of diagnosis lung cancer is the commonest cause of death from cancer in the UK. Existing epidemiological risk models in clinical usage, which have Positive Predictive Values (PPV) of less than 10%, do not consider the temporal relations expressed in sequential electronic health record (EHR) data. Machine learning with deep 'transformer' models can learn from these temporal relationships. We aimed to build such a model for lung cancer diagnosis in primary care using EHR data. Methods In a nested case-control study within the Whole Systems Integrated Care (WSIC) dataset, lung cancer cases were identified and control cases of 'other' cancers or respiratory conditions. GP EHR data going back three years from the date of diagnosis less the most recent one months were semantically pre-processed by mapping from more than 30,000 terms to 450. Model building was performed using ALBERT with a Logistic Regression Classifier (LRC) head. Clustering was explored using k-means. We split the data into 70% training and 30% valida-tion. An additional regression model alone was built on the pre-processed data as a comparator. Findings Based on 3,303,992 patients from January 1981 to December 2020 there were 11,847 lung cancer cases of whom 9,629 had died. 5,789 cases and 7,240 controls were used for training and a population of 368,906 for validation. Our model achieved an AUROC of 0.924 (95% CI 0.921-0.927) with a PPV of 3.6% (95% CI 3.5-3.7) and Sensitivity of 86.6% (95% CI 85.3-87.8) based on the three year's data prior to diagnosis less the immediate month before index diagnosis. The comparator regression model achieved a PPV of 3.1% (95% CI 3.0-3.1) and AUROC of 0.887 (95% CI 0.884-0.889). Interpretation Capturing temporal sequencing between cancer and non-cancer pathways to diagnosis enables much more accurate models. Future work will focus on external dataset validation and integra-tion into GP clinical systems for evaluation.
What problem does this paper attempt to address?