Discovery of MicroDependencies

Jizhou Sun,Jianzhong Li
DOI: https://doi.org/10.1109/access.2019.2910843
IF: 3.9
2019-01-01
IEEE Access
Abstract:Data quality rules compose a class of frequently used tools whose purpose is to improve data quality: examples of these rules include functional dependences, conditional functional dependences, and editing rules, among others. Dependences with stronger expressivity can help to detect more data errors and impute more missing values. To the best of our knowledge, most existing rules consider each attribute as an inseparable whole. In many applications, however, a part of a value contains useful information, and more powerful rules can thus be formed to handle data quality problems. In this paper, we aim to discover the rules of this type, namely, microDependences. The left-hand side of a microDependence is a star-free regular expression, along with the positions of the partial information. This indicates that if a string-type attribute value matches the regular expression, elements at the specified positions can determine another attribute’s value. To discover microDependences, strings with similar forms should be clustered together. Moreover, similar strings should be aligned vertically to shift elements with a similar meaning to the same position. Then, microDependences can be discovered directly or by the existing methods. Both the clustering and aligning tasks are challenging and play key roles in discovering microDependences. A greedy bottom-up framework is proposed to do the clustering and aligning work simultaneously. For efficiency, several pruning strategies are proposed to reduce the time consumed. In the experimental study, our methods’ performances are verified on both real and synthetic data sets.
What problem does this paper attempt to address?