Repairing Data Through Regular Expressions

Zeyu Li,Hongzhi Wang,Wei Shao,Jianzhong Li,Hong Gao
DOI: https://doi.org/10.14778/2876473.2876478
IF: 2.5
2016-01-01
Proceedings of the VLDB Endowment
Abstract:Since regular expressions are often used to detect errors in sequences such as strings or date, it is natural to use them for data repair. Motivated by this, we propose a data repair method based on regular expression to make the input sequence data obey the given regular expression with minimal revision cost. The proposed method contains two steps, sequence repair and token value repair.For sequence repair, we propose the Regular-expression based Structural Repair (RSR in short) algorithm. RSR algorithm is a dynamic programming algorithm that utilizes Nondeterministic Finite Automata (NFA) to calculate the edit distance between a prefix of the input string and a partial pattern regular expression with time complexity of O(nm(2)) and space complexity of O(mn) where m is the edge number of NFA and n is the input string length. We also develop an optimization strategy to achieve higher performance for long strings. For token value repair, we combine the edit-distance-based method and associate rules by a unified argument for the selection of the proper method. Experimental results on both real and synthetic data show that the proposed method could repair the data effectively and efficiently.
What problem does this paper attempt to address?