A Machine Learning Model Based on Genetic and Traditional Cardiovascular Risk Factors to Predict Premature Coronary Artery Disease

Benrong Liu,Lei Fang,Yujuan Xiong,Qiqi Du,Yang Xiang,Xiaohui Chen,Chao-Wei Tian,Shi-Ming Liu
DOI: https://doi.org/10.31083/j.fbl2707211
IF: 3.1
2022-07-04
Frontiers in Bioscience-Landmark
Abstract:BACKGROUND: Premature coronary artery disease (PCAD) has a poor prognosis and a high mortality and disability rate. Accurate prediction of the risk of PCAD is very important for the prevention and early diagnosis of this disease. Machine learning (ML) has been proven a reliable method used for disease diagnosis and for building risk prediction models based on complex factors. The aim of the present study was to develop an accurate prediction model of PCAD risk that allows early intervention.METHODS: We performed retrospective analysis of single nucleotide polymorphisms (SNPs) and traditional cardiovascular risk factors (TCRFs) for 131 PCAD patients and 187 controls. The data was used to construct classifiers for the prediction of PCAD risk with the machine learning (ML) algorithms LogisticRegression (LRC), RandomForestClassifier (RFC) and GradientBoostingClassifier (GBC) in scikit-learn. Three quarters of the participants were randomly grouped into a training dataset and the rest into a test dataset. The performance of classifiers was evaluated using area under the receiver operating characteristic curve (AUC), sensitivity and concordance index. R packages were used to construct nomograms.RESULTS: Three optimized feature combinations (FCs) were identified: RS-DT-FC1 (rs2259816, rs1378577, rs10757274, rs4961, smoking, hyperlipidemia, glucose, triglycerides), RS-DT-FC2 (rs1378577, rs10757274, smoking, diabetes, hyperlipidemia, glucose, triglycerides) and RS-DT-FC3 (rs1169313, rs5082, rs9340799, rs10757274, rs1152002, smoking, hyperlipidemia, high-density lipoprotein cholesterol). These were able to build the classifiers with an AUC >0.90 and sensitivity >0.90. The nomograms built with RS-DT-FC1, RS-DT-FC2 and RS-DT-FC3 had a concordance index of 0.94, 0.94 and 0.90, respectively, when validated with the test dataset, and 0.79, 0.82 and 0.79 when validated with the training dataset. Manual prediction of the test data with the three nomograms resulted in an AUC of 0.89, 0.92 and 0.83, respectively, and a sensitivity of 0.92, 0.96 and 0.86, respectively.CONCLUSIONS: The selection of suitable features determines the performance of ML models. RS-DT-FC2 may be a suitable FC for building a high-performance prediction model of PCAD with good sensitivity and accuracy. The nomograms allow practical scoring and interpretation of each predictor and may be useful for clinicians in determining the risk of PCAD.
cell biology,biochemistry & molecular biology
What problem does this paper attempt to address?