Unlocking stroke prediction: Harnessing projection-based statistical feature extraction with ML algorithms

Saad Sahriar,Sanjida Akther,Jannatul Mauya,Ruhul Amin,Md Shahajada Mia,Sabba Ruhi,Md Shamim Reza
DOI: https://doi.org/10.1016/j.heliyon.2024.e27411
IF: 3.776
2024-03-06
Heliyon
Abstract:Non-communicable diseases, such as cardiovascular disease, cancer, chronic respiratory diseases, and diabetes, are responsible for approximately 71% of all deaths worldwide. Stroke, a cerebrovascular disorder, is one of the leading contributors to this burden among the top three causes of death. Early recognition of symptoms can encourage a balanced lifestyle and provide essential information for stroke prediction. To identify a stroke patient and risk factors, machine learning (ML) is a key tool for physicians. Due to different data measurement scales and their probability distributional assumptions, ML-based algorithms struggle to detect risk factors. Furthermore, when dealing with risk factors with high-dimensional features, learning algorithms struggle with complexity. In this study, rigorous statistical tests are used to identify risk factors, and PCA-FA (Integration of Principal Components and Factors) and FPCA (Factor Based PCA) approaches are proposed for projecting suitable feature representations for improving learning algorithm performances. The study dataset consists of different clinical, lifestyle, and genetic attributes, allowing for a comprehensive analysis of potential risk factors associated with stroke, which contains 5110 patient records. Using significant test (P-value <0.05), chi-square and independent sample t-test identified age, heart_disease, hypertension, work_type, ever_married, bmi, and smoking_status as risk factors for stroke. To develop the predicting model with proposed feature extraction techniques, random forests approach provides the best results when utilizing the PCA-FA method. The best accuracy rate for this approach is 92.55%, while the AUC score is 98.15%. The prediction accuracy has increased from 2.19% to 19.03% compared to the existing work. Additionally, the prediction results is robustified and reproducible with a stacking ensemble-based classification algorithm. We also developed a web-based application to help doctors diagnose stroke risk based on the findings of this study, which could be used as an additional tool to help doctors diagnose.
What problem does this paper attempt to address?