Early detection of non-small cell lung cancer using electronic health record data

Xiudi Li,Erin Yuan,Stephen J Kuperberg,Clara-Lea Bonzel,Mary I Jeffway,Tianrun Cai,Katherine P Liao,Raquel Aguiar-Ibanez,Yu-Han Kao,Melissa L. Santorelli,David C Christiani,Tianxi Cai,Rui Duan
DOI: https://doi.org/10.1101/2024.10.28.24316275
2024-10-29
Abstract:Rationale: Specific patient characteristics increase the risk of cancer, necessitating personalized healthcare approaches. For high-risk individuals, tailored clinical management ensures proactive monitoring and timely interventions. Electronic Health Records (EHR) data are crucial for supporting these personalized approaches, improving cancer prevention and early diagnosis. Objectives: We leverage EHR data and build a prediction model for early detection of non-small cell lung cancer (NSCLC). Methods: We utilize data from Mass General Brigham's EHR and implement a three-stage ensemble learning approach. Initially, we generate risk scores using multivariate logistic regression in a self-control and case-control design to distinguish between cases and controls. Subsequently, these risk scores are integrated and calibrated using a prospective Cox model to develop the risk prediction model. Results: We identified 127 EHR-derived features predictive for early detection of NSCLC. The highly predictive features include smoking, relevant lab test results, and chronic lung diseases. The predictive model reached area under the ROC curve (AUC) of 0.801 (positive predictive value (PPV) 0.0173 with specificity 0.02) for predicting one-year NSCLC risk in a population aged 18 and above, and AUC of 0.757 (PPV 0.0196 with specificity 0.02) in a population aged 40 and above. Conclusions: This study identified EHR derived features which are predictive of early NSCLC diagnosis. The developed risk prediction model exhibits superior performance for early detection of NSCLC compared to a baseline model that only relies on demographic and smoking information, demonstrating the potential of incorporating EHR derived features for personalized cancer screening recommendations and early detection.
What problem does this paper attempt to address?