Automatic Data Repair: Are We Ready to Deploy?

Wei Ni,Xiaoye Miao,Xiangyu Zhao,Yangyang Wu,Shuwei Liang,Jianwei Yin
DOI: https://doi.org/10.14778/3675034.3675051
IF: 2.5
2024-06-01
Proceedings of the VLDB Endowment
Abstract:Data quality is paramount in today's data-driven world, especially in the era of generative AI. Dirty data with errors and inconsistencies usually leads to flawed insights, unreliable decision-making, and biased or low-quality outputs from generative models. The study of repairing erroneous data has gained significant importance. Existing data repair algorithms differ in information utilization, problem settings, and are tested in limited scenarios. In this paper, we compare and summarize these algorithms with a driven information-based taxonomy. We systematically conduct a comprehensive evaluation of 12 mainstream data repair algorithms on 12 datasets under the settings of various data error rates, error types, and 4 downstream analysis tasks, assessing their error reduction performance with a novel but practical metric. We develop an effective and unified repair optimization strategy that substantially benefits the state of the arts. We conclude that, it is always worthy of data repair. The clean data does not determine the upper bound of data analysis performance. We provide valuable guidelines, challenges, and promising directions in the data repair domain. We anticipate this paper enabling researchers and users to well understand and deploy data repair algorithms in practice.
computer science, information systems, theory & methods
What problem does this paper attempt to address?