Abstract:Functional dependencies (FDs) play a very important role in many data management tasks such as schema normalization, data cleaning, and query optimization. Meanwhile, there are ever-increasing application demands for efficient FD discovery on large-scale datasets. Unfortunately, due to huge runtime and memory overhead, the existing single-machine FD discovery algorithms are inefficient for large-scale datasets. Recently, distributed data-parallel computing has become the de facto standard for large-scale data processing. However, it is challenging to design an efficient distributed FD discovery algorithm. In this paper, we present SmartFD, which is an efficient and scalable algorithm for distributed FD discovery. First, we propose a novel attribute sorting-based algorithm framework. Next, to discover all the FDs grouped by a given attribute, we propose an efficient distributed algorithm Attribute-centric Functional Dependency Discovery (AFDD). In AFDD, we design an Fast Sampling and Early Aggregation (FSEA) mechanism to improve the efficiency of distributed sampling and propose a memory-efficient index-based method for distributed FD validation. Moreover, AFDD employs an attribute-parallel method to accelerate the pruning-and-generation of candidate FDs. Furthermore, we propose an adaptive switching strategy between distributed sampling and distributed validation based on the unified time-based efficiency metric. Also, we employ a distributed probing based method to make the switching strategy more accurate. Experimental results on Apache Spark reveal that SmartFD outperforms the state-of-the-art single-machine algorithm HyFD and the existing distributed algorithm HFDD with 3.2 & x00D7;-44.9 & x00D7; and 2.5 & x00D7;-455.7 & x00D7; speedup respectively. Moreover, SmartFD achieves good row scalability and column scalability. Additionally, SmartFD has sub-linear node scalability.

DGST: Efficient and Scalable Suffix Tree Construction on Distributed Data-Parallel Platforms.

ERA: Efficient Serial and Parallel Suffix Tree Construction for Very Long Strings

Using GPU to Accelerate Suffix Array Construction

Efficient and Scalable Functional Dependency Discovery on Distributed Data-Parallel Platforms.

Unordered Task-Parallel Augmented Merge Tree Construction

A Stack-Centric Processing Model for Iterative Processing

A grid-aided and STR-Tree-based algorithm for partitioning vector data

DSA: Scalable Distributed Sequence Alignment System Using SIMD Instructions.

SparkRDF: Elastic Discreted RDF Graph Processing Engine with Distributed Memory

SparkDQ: Efficient Generic Big Data Quality Management on Distributed Data-Parallel Computation

Parallel and private generalized suffix tree construction and query on genomic data

Scalable and Efficient Construction of Suffix Array with MapReduce and In-Memory Data Store System

Parallel Strong Connectivity Based on Faster Reachability

Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and Hierarchical Spatial Clustering

Parallel and distributed architecture of genetic algorithm on Apache Hadoop and Spark

Distributed Gene Clinical Decision Support System Based on Cloud Computing

Massively Parallel SPMD Algorithm for Cluster Computing — Combining Genetic Algorithm with Uphill

Exploiting Scalable Parallelism for Remote Sensing Analysis Models by Data Transformation Graph.

Exploiting Parallelism for Bioinformatics Data Analysis Applications by Data Transformation Graph

Optimal Parallel Algorithms for Dendrogram Computation and Single-Linkage Clustering