RTClean: Context-aware Tabular Data Cleaning using Real-time OFDs

Daniel Del Gaudio,Tim Schubert,Mohamed Abdelaal
DOI: https://doi.org/10.48550/arXiv.2302.04726
2023-02-10
Abstract:Nowadays, machine learning plays a key role in developing plenty of applications, e.g., smart homes, smart medical assistance, and autonomous driving. A major challenge of these applications is preserving high quality of the training and the serving data. Nevertheless, existing data cleaning methods cannot exploit context information. Thus, they usually fail to track shifts in the data distributions or the associated error profiles. To overcome these limitations, we introduce, in this paper, a novel method for automated tabular data cleaning powered by dynamic functional dependency rules extracted from a live context model. As a proof of concept, we create a smart home use case to collect data while preserving the context information. Using two different data sets, our evaluations show that the proposed cleaning method outperforms a set of baseline methods in terms of the detection and repair accuracy.
Databases
What problem does this paper attempt to address?