Efficient Relaxed Functional Dependency Discovery with Minimal Set Cover

Xiaoou Ding,Yida Liu,Hongzhi Wang,Chen Wang,Yichen Song,Donghua Yang,Jianmin Wang
DOI: https://doi.org/10.1109/icde60146.2024.00271
2024-01-01
Abstract:Assessing data quality through Functional Depen-dencies (FDs) is a crucial aspect of data governance. However, with the diverse range of data sources and the exponential growth in data volume, exact FDs can sometimes be impractical for real-world applications. In contrast, relaxed functional dependencies (RFDs), which allows for some flexibility in attribute value comparisons, demonstrates greater adaptability and flexibility for big data scenarios. To address the efficient discovery of RFDs, this paper proposes a novel mining method to supplement the current research gaps. By establishing a difference table for tuples, we transform the problem into a specialized minimal set covering problem. Additionally, we introduce two optimization strategies: reducing the time complexity of enumerating the left-hand side of the base RFDs to 0 (1) and decreasing the search complexity for feasible LHS attributes and threshold candidates from O(2 m-l ) to O(1.5 m-1 ). We rigorously proof that our mining approach guarantees the identification of validity and minimal RFDs. Experiments on nine real-world datasets reveal that our method significantly improves efficiency compared to existing techniques. Furthermore, it uncovers more concise and higher-quality RFDs. Importantly, the RFDs extracted through our methodology exhibit better performance in downstream cleaning tasks.
What problem does this paper attempt to address?