Comparison of Performance of Data Imputation Methods for Numeric Dataset
Anil Jadhav,Dhanya Pramod,Krishnan Ramanathan
DOI: https://doi.org/10.1080/08839514.2019.1637138
IF: 2.777
2019-07-04
Applied Artificial Intelligence
Abstract:Missing data is common problem faced by researchers and data scientists. Therefore, it is required to handle them appropriately in order to get better and accurate results of data analysis. Objective of this research paper is to provide better understanding of data missingness mechanism, data imputation methods, and to assess performance of the widely used data imputation methods for numeric dataset. It will help practitioners and data scientists to select appropriate method of data imputation for numeric dataset while performing data mining task. In this paper, we comprehensively compare seven data imputation methods namely mean imputation, median imputation, kNN imputation, predictive mean matching, Bayesian Linear Regression (norm), Linear Regression, non-Bayesian (norm.nob), and random sample. We have used five different numeric datasets obtained from UCI machine learning repository for analyzing and comparing performance of the data imputation methods. Performance of the data imputation methods is assessed using Normalized Root Mean Square Error (RMSE) method. The results of analysis show that kNN imputation method outperforms the other methods. It has also been found that performance of the data imputation method is independent of the dataset and percentage of missing values in the dataset.
computer science, artificial intelligence,engineering, electrical & electronic