Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data
Chengkun Sun,Erin Mobley,Michael Quillen,Max Parker,Meghan Daly,Rui Wang,Isabela Visintin,Ziad Ziad,Jennifer Fishe,Alexander Parker,Thomas George,Jiang Bian,Jie Xu
DOI: https://doi.org/10.1101/2024.07.17.24310573
2024-07-17
Abstract:Background: Colorectal cancer (CRC) is now the leading cause of cancer-related deaths among young Americans. Our study aims to predict early-onset CRC (EOCRC) using machine learning (ML) and structured electronic health record (EHR) data for individuals under the screening age of 45.
Methods: We identified a cohort of patients under 45 from the OneFlorida+ Clinical Research Consortium. Given the distinct pathology of colon cancer (CC) and rectal cancer (RC), we created separate prediction models for each cancer type with various ML algorithms. We assessed multiple prediction time windows (0, 1, 3, and 5 years) and ensured robustness through propensity score matching (PSM) to account for confounding variables. Model performance was assessed using established metrics. Additionally, we employed the Shapley Additive exPlanations (SHAP) to identify risk factors for EOCRC.
Results: Our study yielded results, with Area Under the Curve (AUC) scores of 0.811, 0.748, 0.689, and 0.686 for CC prediction, and 0.829, 0.771, 0.727, and 0.721 for RC prediction at 0, 1, 3, and 5 years, respectively. Notably, predictors included immune and digestive system disorders, along with secondary cancers and underweight, prevalent in both CC and RC groups. Blood diseases emerged as prominent indicators of CC.
Conclusion: This study highlights the potential of ML techniques in leveraging EHR data to predict EOCRC, offering valuable insights for potential early diagnosis in patients who are below the recommended screening age.
Health Informatics