Abstract:In the realm of medical datasets, particularly when considering diabetes, the occurrence of data incompleteness is a prevalent issue. Unveiling valuable patterns through medical data analysis is crucial for early and precise medical predictions. However, the quality of data and the proper handling of missing data hold significant significance. To address this challenge, imputation stands as a robust approach. The main goal of this paper aims to provide a comprehensive investigation into the effects brought about by Machine Learning (ML) based imputation techniques, specifically K Nearest Neighbor Imputation (KNNI), Multiple Imputation by Chained Equations (MICE), and MissForest. Results of all three techniques are compared with the complete dataset for five missing rates (10% to 50%), and evaluated using four categories of evaluation criteria i.e. (1) model performance, (2) imputation error rate (Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Coefficient of Determination (R^2) values), (3) Pearson correlation analysis and, (4) model selection basis (Bayesian information criterion (BIC), Akaike information criterion (AIC), values). Model performance includes accuracy, precision, recall, F1 score, and Matthews Correlation Coefficient (Mcoff) score of four ML classifiers viz. (a) Random Forest (RF), (b) Support vector machine (SVM), (c) AdaBoost, (d) XGBoost (XGB). For all missing rate cases, the MissForest technique is better than the KNNI and MICE in accuracy and Mcoff in 80% of cases, precision in 40% of cases, recall in 60% of cases, F1 score, MAE, RMSE, R^2 in 100% of cases, AIC in 80% of cases, and BIC values in 100% of cases. Also, the correlation analysis confirms that the MissForest imputation preserves association between the variables, like the complete dataset. Overall, our findings suggest that MissForest is a better machine learning-based imputation technique for handling missing data in diabetes research.

A Probabilistic Approach for Missing Data Imputation

Missing Data Imputation: Focusing on Single Imputation.

CHOOSING APPROPRIATE IMPUTATION METHODS FOR MISSING DATA: A DECISION ALGORITHM ON METHODS FOR MISSING DATA

SICE: an improved missing data imputation technique

Missing data imputation using correlation coefficient and min-max normalization weighting

An Intelligent Missing Data Imputation Techniques: A Review

Missing value imputation using unsupervised machine learning techniques

Does imputation matter? Benchmark for predictive models

A novel ranked k-nearest neighbors algorithm for missing data imputation

In-Database Data Imputation

Imputation using information fusion technique for sensor generated incomplete data with high missing gap

Machine Learning Based Missing Values Imputation in Categorical Datasets

A computational strategy for estimation of mean using optimal imputation in presence of missing observation

On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets

iDMI: A novel technique for missing value imputation using a decision tree and expectation-maximization algorithm

A Dynamic Model for Imputing Missing Medical Data: A Multiobjective Particle Swarm Optimization Algorithm

An Experimental Survey of Missing Data Imputation Algorithms

Missing Values Imputation Based on Iterative Learning

Evaluation of imputation techniques with varying percentage of missing data

Impact of machine learning-based imputation techniques on medical datasets- a comparative analysis

Missing Data Imputation for Classification Problems