Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data

Chengkun Sun,Erin Mobley,Michael Quillen,Max Parker,Meghan Daly,Rui Wang,Isabela Visintin,Ziad Ziad,Jennifer Fishe,Alexander Parker,Thomas George,Jiang Bian,Jie Xu
DOI: https://doi.org/10.1101/2024.07.17.24310573
2024-07-17
Abstract:Background: Colorectal cancer (CRC) is now the leading cause of cancer-related deaths among young Americans. Our study aims to predict early-onset CRC (EOCRC) using machine learning (ML) and structured electronic health record (EHR) data for individuals under the screening age of 45. Methods: We identified a cohort of patients under 45 from the OneFlorida+ Clinical Research Consortium. Given the distinct pathology of colon cancer (CC) and rectal cancer (RC), we created separate prediction models for each cancer type with various ML algorithms. We assessed multiple prediction time windows (0, 1, 3, and 5 years) and ensured robustness through propensity score matching (PSM) to account for confounding variables. Model performance was assessed using established metrics. Additionally, we employed the Shapley Additive exPlanations (SHAP) to identify risk factors for EOCRC. Results: Our study yielded results, with Area Under the Curve (AUC) scores of 0.811, 0.748, 0.689, and 0.686 for CC prediction, and 0.829, 0.771, 0.727, and 0.721 for RC prediction at 0, 1, 3, and 5 years, respectively. Notably, predictors included immune and digestive system disorders, along with secondary cancers and underweight, prevalent in both CC and RC groups. Blood diseases emerged as prominent indicators of CC. Conclusion: This study highlights the potential of ML techniques in leveraging EHR data to predict EOCRC, offering valuable insights for potential early diagnosis in patients who are below the recommended screening age.
Health Informatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to predict early - onset colorectal cancer (EOCRC) in individuals under 45 years old. Specifically, the research objectives are as follows: 1. **Utilize machine learning and electronic health record (EHR) data**: By using multiple machine - learning algorithms and combining structured electronic health record data, predict early - onset colorectal cancer in individuals under 45 years old. 2. **Distinguish between colon cancer and rectal cancer**: Since colon cancer (CC) and rectal cancer (RC) differ in pathology, molecular mechanisms, clinical manifestations, surgical methods, and treatment strategies, the study established prediction models for these two cancer types respectively. 3. **Evaluate the prediction performance in different time windows**: The study evaluated multiple prediction time windows (0 years, 1 year, 3 years, and 5 years) to ensure the prediction ability of the model at different time points. 4. **Reduce the influence of data bias and confounding factors**: Through the propensity score matching (PSM) method, establish a comparable control group to reduce the influence of potential data bias and confounding variables. 5. **Improve the interpretability of the model**: Use the Shapley Additive exPlanations (SHAP) method to identify and explain the key risk factors affecting EOCRC prediction, enhancing the transparency and reliability of the model. Through these methods, the study aims to provide valuable early - diagnosis information for individuals under 45 years old, thereby improving the disease management and prognosis of this group.