Distribution preserving train-test split directed ensemble classifier for heart disease prediction

Debasis Mohapatra,Sourav Kumar Bhoi,Chittaranjan Mallick,Kalyan Kumar Jena,Satrujit Mishra
DOI: https://doi.org/10.1007/s41870-022-00868-2
2022-01-21
International Journal of Information Technology
Abstract:Every year, the worldwide health record reports enormous cases of deaths due to heart disease. The advancement in healthcare system has tackled these issues in some extent but still the severity of heart disease persists in the society. In near past, huge amount of effort has been made to incorporate computational techniques like machine learning based approaches to handle this issue in an effective way. Several research articles report the use of machine learning approach for early prediction of the heart disease from the data of different clinical attributes obtained from clinical investigations/tests. Specifically, the supervised machine learning approaches used for this purpose prepares the model from the available datasets collected from the patients’ health records with their known status of suffering from heart disease or not, and the model can predict a person is suffering from heart disease or not. In the same line, we apply some standard classifiers on the heart disease dataset collected from UCI machine learning repository. Unlike existing proposals, we propose a distribution preserving train-test splitting and after that apply the classifiers on it. Likewise, we also consider the ensemble classifiers for this purpose. The result shows that Naïve Bayes Classifier (NB-C) performs best among all individual classifiers under consideration according to Accuracy, Precision, Recall, and F1-score. We also prepare an ensemble (ALN-C) of three best individual classifiers obtained from the evaluation i.e., Artificial Neural Network Classifier (ANN-C), Logistic Regression Classier (LR-C), and Naïve Bayes Classifier (NB-C) and compare it with two existing ensemble methods: AdaBoost, and Random Forest. For the proposed distribution preserving train-test splitting, ALN-C ensemble method outperforms AdaBoost, and Random Forest according to Accuracy, and F1-score.
What problem does this paper attempt to address?