Machine Learning Based Missing Values Imputation in Categorical Datasets

Muhammad Ishaq,Sana Zahir,Laila Iftikhar,Mohammad Farhad Bulbul,Seungmin Rho,Mi Young Lee
DOI: https://doi.org/10.1109/ACCESS.2024.3411817
2024-09-12
Abstract:In order to predict and fill in the gaps in categorical datasets, this research looked into the use of machine learning algorithms. The emphasis was on ensemble models constructed using the Error Correction Output Codes framework, including models based on SVM and KNN as well as a hybrid classifier that combines models based on SVM, KNN,and MLP. Three diverse datasets, the CPU, Hypothyroid, and Breast Cancer datasets were employed to validate these algorithms. Results indicated that these machine learning techniques provided substantial performance in predicting and completing missing data, with the effectiveness varying based on the specific dataset and missing data pattern. Compared to solo models, ensemble models that made use of the ECOC framework significantly improved prediction accuracy and robustness. Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data and the possibility of overfitting. Subsequent research endeavors ought to evaluate the feasibility and efficacy of deep learning algorithms in the context of the imputation of missing data.
Machine Learning
What problem does this paper attempt to address?
The paper aims to address the problem of predicting and imputing missing data in classification datasets. Specifically, the focus of the study is on utilizing machine learning algorithms, particularly ensemble models based on the Error Correction Output Codes (ECOC) framework (including models based on SVM and support vector machines, K-nearest neighbors [KNN], and hybrid multilayer perceptrons [MLP]) to improve the accuracy and robustness of missing data prediction. By validating the effectiveness of these algorithms on three different datasets (CPU, hypothyroidism, and breast cancer datasets), the results show that these machine learning techniques exhibit significant performance in predicting and imputing missing data, although the effectiveness varies depending on the specific dataset and missing data patterns. Furthermore, compared to single models, the ensemble models using the ECOC framework significantly improved prediction accuracy and robustness. However, despite the encouraging results, challenges remain in using deep learning for missing data imputation, such as the need for large amounts of labeled data and the risk of overfitting. Therefore, future research should evaluate the feasibility and effectiveness of deep learning algorithms in missing data imputation.