IterClean: an Iterative Data Cleaning Framework with Large Language Models

Wei Ni,Kaihang Zhang,Xiaoye Miao,Xiangyu Zhao,Yangyang Wu,Jianwei Yin
DOI: https://doi.org/10.1145/3674399.3674436
2024-01-01
Abstract:In the era of generative artificial intelligence, the accuracy of data is paramount. Erroneous data often leads to faulty outcomes and economic detriments. Previous cleaning methods employ a sequential detect-repair paradigm, leaving over half of the errors unsolved in real scenarios. We introduce IterClean, an iterative data cleaning framework leveraging large language models (LLMs). Utilizing an iterative mechanism, the framework employs a two-step process: data labeling and iterative data cleaning. With few labeled data, IterClean leverages an iterative cleaning process involving an error detector, an error verifier, and an error repairer to significantly enhance the cleaning performance. Extensive experiments across four datasets demonstrate that, IterClean achieves an F1 score that is up to three times higher than the best state-of-the-art approaches requiring only 5 labeled tuples.
What problem does this paper attempt to address?