A Comparison Study of Cox Models and Machine Learning Methods for Developing Breast Cancer Prognostic Prediction Models (Preprint)

Jialong Xiao,Miao Mo,Zezhou Wang,Changming Zhou,Jie Shen,Jing Yuan,Yulian He,Ying Zheng
DOI: https://doi.org/10.2196/preprints.33440
2021-01-01
Abstract:BACKGROUND Over recent years, machine learning (ML) methods have been increasingly explored in cancer prognosis prediction because of the appearance of improved machine learning algorithms. These algorithms can use censored data for modeling, such as support vector machines (SVM) for survival analysis and random survival forest (RSF). However, it is still debated whether traditional (Cox proportional hazard regression) or ML-based prognostic prediction models have better predictive performance. OBJECTIVE This study aims to use the machine learning algorithms to predict the survival of breast cancer and compare the predictive performance with the traditional Cox regression. METHODS This retrospective cohort study included all patients diagnosed with breast cancer and subsequently hospitalized in Fudan University Shanghai Cancer Center (FUSCC) between January 1, 2008 and December 31, 2016. A total of 25267 cases with 21 features were eligible for model development, and the data set was randomly split into a train set (70%) and a test set (30%) for developing four models and predicting overall survival in breast cancer patients. The discriminative ability of models was evaluated by the concordance index (C-index) and the time-dependent area under the curve (AUC); the calibration ability of models was evaluated by the Brier score. RESULTS The RSF model revealed the best discriminative performance among the four models with 3-year, 5-year and 10-year time-dependent AUC of 0.857, 0.838 and 0.781, respectively and C-index of 0.827 (0.809, 0.845), which significantly outperformed the Cox-EN model (0.816, p=0.007), the Cox model (0.814, p=0.003) and the SVM model (0.812, p<0.001). The four models' 3-year, 5-year, and 10-year brier scores were very close, ranging from 0.027 to 0.094, which meant all models had good calibration. In the context of feature importance, elastic net and RSF both indicated that TNM staging, neoadjuvant therapy, number of lymph node metastases, age, and tumor diameter were the top 5 important features for predicting the prognosis of breast cancer. A final online tool was developed to predict the overall survival of breast cancer patients. CONCLUSIONS RSF model slightly outperformed the other models on discriminative ability, revealing the great potential to be used as an effective approach for survival analysis. CLINICALTRIAL ClinicalTrials. gov, registration number: NCT04996732.
What problem does this paper attempt to address?