A Hybrid Data Cleaning Framework Using Markov Logic Networks

Congcong Ge,Yunjun Gao,Xiaoye Miao,Bin Yao,Haobo Wang
DOI: https://doi.org/10.1109/tkde.2020.3012472
IF: 9.235
2020-01-01
IEEE Transactions on Knowledge and Data Engineering
Abstract:With the growth of dirty data, data cleaning turns into a crux of data analysis. In this paper, we propose a novel hybrid data cleaning framework, termed as MLNClean, which is capable of learning instantiated rules to supplement the insufficient integrity constraints. MLNClean consists of two steps, i.e., pre processing and two stage data cleaning. In the pre-processing step, MLNClean first infers a set of probable instantiated rules according to Markov logic network (MLN) and then builds a two-layer MLN index to generate multiple data versions and facilitate the cleaning process. In the two-stage data cleaning step, MLNClean first presents a concept of reliability score to clean errors within each data version separately, and then, it eliminates the conflict values among different data versions using a novel concept of fusion score. Considerable experimental results on both real and synthetic scenarios demonstrate the effectiveness of MLNClean.
What problem does this paper attempt to address?