GDS: General Distributed Strategy for Functional Dependency Discovery Algorithms
Peizhong Wu,Wei Yang,Haichuan Wang,Liusheng Huang
DOI: https://doi.org/10.1007/978-3-030-59410-7_17
2020-01-01
Abstract:Functional dependencies (FDs) are important metadata that describe relationships among columns of datasets and can be used in a number of tasks, such as schema normalization, data cleansing. In modern big data environments, data are partitioned, so that single-node FD discovery algorithms are inefficient without parallelization. However, existing parallel distributed algorithms bring huge communication costs and thus perform not well enough. To solve this problem, we propose a general parallel discovery strategy, called GDS, to improve the performance of parallelization for single-node algorithms. GDS consists of two essential building blocks, namely FD-Combine algorithm and affine plane block design algorithm. The former can infer the final FDs from part-FD sets. The part-FD set is a FD set holding over part of the original dataset. The latter generates data blocks, making sure that part-FD sets of data blocks satisfy FD-Combine induction condition. With our strategy, each single-node FD discovery algorithm can be directly parallelized without modification in distributed environments. In the evaluation, with p threads, the speedups of FD discovery algorithm FastFDs exceed root p in most cases and even exceed p/2 in some cases. In distributed environments, the best multi-threaded algorithm HYFD also gets a significant improvement with our strategy when the number of threads is large.