Data Cleaning and Machine Learning: A Systematic Literature Review

Pierre-Olivier Côté,Amin Nikanjam,Nafisa Ahmed,Dmytro Humeniuk,Foutse Khomh

2024-05-31

Abstract:Context: Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing interest in approaches to detect and repair data errors (i.e., data cleaning). Researchers are also exploring how ML can be used for data cleaning; hence creating a dual relationship between ML and data cleaning. To the best of our knowledge, there is no study that comprehensively reviews this relationship. Objective: This paper's objectives are twofold. First, it aims to summarize the latest approaches for data cleaning for ML and ML for data cleaning. Second, it provides future work recommendations. Method: We conduct a systematic literature review of the papers published between 2016 and 2022 inclusively. We identify different types of data cleaning activities with and for ML: feature cleaning, label cleaning, entity matching, outlier detection, imputation, and holistic data cleaning. Results: We summarize the content of 101 papers covering various data cleaning activities and provide 24 future work recommendations. Our review highlights many promising data cleaning techniques that can be further extended. Conclusion: We believe that our review of the literature will help the community develop better approaches to clean data.

Machine Learning,Databases

What problem does this paper attempt to address?

This paper aims to address the research gap between data cleaning and machine learning (ML). The two main objectives of the study are: to summarize the latest data cleaning methods, especially for data cleaning for machine learning (DC4ML) and using machine learning for data cleaning (ML4DC); and to provide recommendations for future work. The paper performs a systematic literature review (SLR) on 101 relevant papers published between 2016 and 2022, covering various data cleaning activities such as feature cleaning, label cleaning, entity matching, anomaly detection, missing value imputation, and overall data cleaning. The paper points out that with the widespread application of machine learning in various industries, data quality has become increasingly important for model performance. Although existing research has explored how machine learning can improve the efficiency and accuracy of data cleaning, there has not yet been a comprehensive review of this field. Therefore, the aim of this paper is to provide the community with a baseline understanding of current data cleaning techniques and encourage further development of better data cleaning methods. Through this systematic review, the authors provide 24 recommendations for future research directions and emphasize the potential of data cleaning techniques that can be further expanded. Additionally, the paper highlights the importance of data cleaning in improving the performance of machine learning systems, especially in the data-intensive trend of artificial intelligence, where the quality of data is becoming more important than the models themselves.

Data Cleaning and Machine Learning: A Systematic Literature Review

Data cleaning and machine learning: a systematic literature review

Machine Learning and Data Cleaning: Which Serves the Other?

CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach

Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making

A systematic literature review of machine learning in online personal health data

A systematic literature review on Machine Learning Model evaluation on healthcare applications

Distance-based Data Cleaning: A Survey (Technical Report)

Machine Learning Data Practices through a Data Curation Lens: An Evaluation Framework

Overview and Importance of Data Quality for Machine Learning Tasks

Maintainability Challenges in ML: A Systematic Literature Review

A Hybrid Data Cleaning Framework Using Markov Logic Networks

A Primer on the Data Cleaning Pipeline

Human-Centric Data Cleaning [Vision]

Data Quality Antipatterns for Software Analytics

Use of machine learning in geriatric clinical care for chronic diseases: a systematic literature review

Systematic Literature Review on Application of Learning-based Approaches in Continuous Integration

A Systematic Mapping Study on Testing of Machine Learning Programs

Machine Learning Approaches for Fake Reviews Detection: A Systematic Literature Review