Survival Prediction in Non-Small Cell Lung Cancer Patients: Performance Evaluation of Machine Learning Ensemble Models (Preprint)
Jianguo Sun,Lingchen Li,Anni Zhang,Chenrui Yin,Dingqin Cai,Yi Wang,Lijiao Xie,Lvjun Yan,Kai Niu,Jianbo Luo,Xiaoyan Shi,Chunli Jian,Linpeng Zheng,Fang Bin,Zhou Yi,Liu Zepeng
DOI: https://doi.org/10.2196/preprints.66993
2024-01-01
Abstract:Non-small cell lung cancer (NSCLC) represents the foremost cause of cancer-related mortality, underscoring the need for precise prognostic tools to enhance clinical decision-making. Existing models often fall short in providing reliable survival predictions. This study aims to develop a robust machine learning ensemble model to refine survival probability estimations for NSCLC patients and support clinical decisions. Following the application of inclusion and exclusion criteria, we utilized the SEER cohort to identify 93,828 NSCLC patients diagnosed between 2004 and 2019. The dataset was partitioned into training (70%) and testing (30%) subsets. Spearman correlation and univariate Cox regression analyses were conducted to identify significant variables influencing overall survival. A soft voting ensemble model was constructed, integrating MLP, XGBoost, and Random Forest algorithms. Model performance was assessed using Area Under the Curve (AUC) and decision curve analysis (DCA) in both the training and internal validation datasets. An external validation cohort comprising 325 NSCLC patients from our hospital (2015-2023) was used to evaluate generalizability. Eleven critical variables were identified: sex, age, histologic type, tumor grade, tumor size, N stage, M stage, and metastases (bone, lung, brain, liver). The soft voting ensemble model yielded AUCs of 0.819, 0.818, and 0.819 for predicting 1-year, 3-year, and 5-year survival in the internal test set, respectively, demonstrating substantial predictive accuracy. External validation produced AUCs of 0.809, 0.793, and 0.800. DCA affirmed the model’s clinical relevance. SHapley Additive exPlanations (SHAP) analysis highlighted tumor size, N stage, and M stage as the most influential factors. The soft voting ensemble model offers a highly accurate prognostic tool for NSCLC survival prediction, significantly enhancing clinical prognosis. SHAP feature importance analysis further supports the development of personalized treatment strategies. This study was supported by the National Natural Science Foundation of China (82172670, 81972858, 82202951 and 81773245), the Technology Innovation and Application Development Project of Chongqing (2023DBXM002 and CSTB2022TIAD-KPX0176 and CSTB2022NSCQ-MSX1356 and 2022yjgA06) and the Cultivation Program for Clinical Research Talents of Army Medical University (2018XLC1010 and 2019XQN10).