Addressing Missing Data in a Healthcare Dataset Using an Improved kNN Algorithm

Tressy Thomas,Enayat Rajabi
DOI: https://doi.org/10.1007/978-3-030-77977-1_17
2021-01-01
Abstract:Missing values are ubiquitous in many real-world datasets. In scenarios where a dataset is not very large, addressing its missing values by utilizing appropriate data imputation methods benefits analysis significantly. In this paper, we leveraged and evaluated a new imputation approach called k-Nearest Neighbour with Most Significant Features and incomplete cases (KNNIMSF$$_\mathrm{MSF}$$) to impute missing values in a healthcare dataset. This algorithm leverages k-Nearest Neighbour (kNN) and ReliefF feature selection techniques to address incomplete cases in the dataset. The merit of imputation is measured by comparing the classification performance of data models trained with the dataset with imputation and without imputation. We used a real-world dataset, “very low birth weight infants”, to predict the survival outcome of infants with low birth weights. Five different classifiers were used in the experiments. The comparison of multiple performance metrics shows that classifiers built on imputed dataset produce much better outcomes. KNNIMSF$$_\mathrm{MSF}$$ outperformed in general than the k-Nearest Neighbour Imputation using the Random Forest feature weights (KNNIRF$$_\mathrm{RF}$$) algorithm with respect to the balanced accuracy and specificity.
What problem does this paper attempt to address?