Abstract:Complex data mining algorithms are processed in multiple iterations, where output of one iteration is used as input for the subsequent iterations. Existing parallel programming frameworks, e.g., MapReduce, Pregel and Spark, adopt the breadth first search (BFS) strategy to process those iterative jobs. They invoke the user-defined functions for every key-value pair or vertex to produce all possible intermediate results for the next iteration. Such BFS strategy incurs high I/O overheads, because normally, the size of intermediate search results of BFS is exponential to the size of original data, making it impossible to maintain those intermediate results in memory. In this paper, we present a new type of parallel programming model, the stack-centric model, where all computations are defined for a stack maintained in the distributed shared memory. The stack can be adaptively split into multiple stacks and disseminated to different compute nodes for parallel processing. The most distinguished feature of the stack-centric model is its support for the depth first search (DFS) algorithm which incurs much less memory overhead than its BFS counterpart. The maximal memory usage of DFS algorithm is determined by the height of its search tree, and hence, it is possible to conduct the computation of DFS algorithm mostly in memory. Our stack-centric model is not a pure DFS framework. It supports the hybrid BFS and DFS algorithms by tuning the trade-off between memory usage and parallelism. To show the advantages of stack-centric model, we implement two algorithms, frequent pattern mining algorithm and DNA sequence matching algorithm, on both stack-centric model and Spark. The memory usage of stack-centric model is 10 times less than the Spark, resulting in a significant performance improvement.

Optimization of parallel FP-Growth algorithm based on Spark

A FP Growth Algorithm Based on MapReduce Model and It′s Application

Distributed High-Dimension Matrix Operation Optimization on Spark

Pfp: Parallel Fp-Growth For Query Recommendation

An Implementation of FP-Growth Algorithm Based on High Level Data Structures of Weka-JUNG Framework

A Guided FP-growth algorithm for multitude-targeted mining of big data

A Stack-Centric Processing Model for Iterative Processing

An Improved Algorithm Based on FP-growth

Design and Implementation of Parallel DBSCAN Algorithm Based on Spark

Parallel and distributed architecture of genetic algorithm on Apache Hadoop and Spark

Tree Partition Based Parallel Frequent Pattern Mining on Shared Memory Systems

Parallel Frequent Pattern Mining Without Candidate Generation on GPUs.

A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment

BiFuG2-Spark: Bi-directional Fuzzy Granular-Cabin Parallel Attribute Reduction Accelerator with Granular-Group Collaboration

An Improved FP-Growth Algorithm with Time Decay Factor and Element Attention Weight

Study of ELM Algorithm Parallelization Based on Spark

Memory optimization of Spark parallel computing framework

Tuning the granularity of parallelism for distributed graph processing

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

The parallel algorithms for LIBSVM parameter optimization based on Spark