Explainable machine learning model identified potential biomarkers in liver cancer survival prediction

Qi Pan,Alphonse Houssou Hounye,Kexin Miao,Liuyan Su,Jiaoju Wang,Muzhou Hou,Li Xiong
DOI: https://doi.org/10.1016/j.bspc.2024.106504
IF: 5.1
2024-06-01
Biomedical Signal Processing and Control
Abstract:Objective Liver cancer is a malignant tumor with a high incidence, and common treatments include surgical resection, ablation, arterial catheterization, and liver transplantation. Enhancing the clinical evaluation and therapy management of LIHC is a crucial matter, and when incorporating machine learning methods into decision-making procedures, it is crucial to consider the comprehensibility of the models. In this current study, the SHapley Additive exPlanation (SHAP) technique was applied to interpret a gradient-boosting decision tree (XGBoost) model utilizing the Cancer Genome Atlas (TCGA) data for interpreting survival black-box models to identify the potential biomarkers for liver cancer survival prediction. Methods The TCGA database is utilized to access expression data and clinical information for liver cancer samples, while Immunogenic Cell Death (ICD)-related genes were retrieved from the literature. Gene screening using bioinformatics methods and machine learning methods. The screened differentially expressed genes (DEGs) and ICDs were jointly constructed as the SurvMLSHAP model, and the SurvMLSHAP score was calculated. Three methods, bayesian optimization, random search, and genetic algorithm were used for parameter optimization. Eight machine learning models were built to evaluate the model's superiority and select the best model based on the suggested model. Results The SurvMLSHAP model output was interpreted using the XGBoost-based SHAP method to assess the influence and significance of each feature. Tests conducted on both synthetic and medical data validate the capability of SurvMLSHAP to identify factors that have a time-dependent impact. The C-index of the raw data and validation data were 0.6844 and 0.8167, respectively. Furthermore, the aggregation of SurvMLSHAP yields a more accurate assessment of variable relevance for prediction compared to other existing approaches. The features contributing to the XGBoost model were, in order CEP55, PPIA, TTC36, HSP90AA1, which could be used as predictors to assess the liver hepatocellular carcinoma(LIHC) cohort, while the putative molecular subgroups could provide new ideas for individualized treatment of LIHC. Conclusion In this study, a risk prognostic model was constructed called SurvMLSHAP based on bioinformatics and machine learning methods and screened for ICD-related biomarkers to assess the prognostic outcome of LIHC patients, which can provide personalized treatment for clinical patients.
engineering, biomedical
What problem does this paper attempt to address?