Abstract:Complex data mining algorithms are processed in multiple iterations, where output of one iteration is used as input for the subsequent iterations. Existing parallel programming frameworks, e.g., MapReduce, Pregel and Spark, adopt the breadth first search (BFS) strategy to process those iterative jobs. They invoke the user-defined functions for every key-value pair or vertex to produce all possible intermediate results for the next iteration. Such BFS strategy incurs high I/O overheads, because normally, the size of intermediate search results of BFS is exponential to the size of original data, making it impossible to maintain those intermediate results in memory. In this paper, we present a new type of parallel programming model, the stack-centric model, where all computations are defined for a stack maintained in the distributed shared memory. The stack can be adaptively split into multiple stacks and disseminated to different compute nodes for parallel processing. The most distinguished feature of the stack-centric model is its support for the depth first search (DFS) algorithm which incurs much less memory overhead than its BFS counterpart. The maximal memory usage of DFS algorithm is determined by the height of its search tree, and hence, it is possible to conduct the computation of DFS algorithm mostly in memory. Our stack-centric model is not a pure DFS framework. It supports the hybrid BFS and DFS algorithms by tuning the trade-off between memory usage and parallelism. To show the advantages of stack-centric model, we implement two algorithms, frequent pattern mining algorithm and DNA sequence matching algorithm, on both stack-centric model and Spark. The memory usage of stack-centric model is 10 times less than the Spark, resulting in a significant performance improvement.

Scalable and Parallel Sequential Pattern Mining Using Spark

Mining Uncertain Sequential Patterns in Iterative MapReduce

Repetitive nonoverlapping sequential pattern mining

Study on Distributed Sequential Pattern Discovery Algorithm

Mining Sequential Patterns by Pattern-Growth: the PrefixSpan Approach.

Self-adaptive nonoverlapping sequential pattern mining

Parallel Sequential Pattern Mining by Transaction Decomposition

HANP-Miner: High average utility nonoverlapping sequential pattern mining

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

Memory-Efficient Sequential Pattern Mining with Hybrid Tries

A Stack-Centric Processing Model for Iterative Processing

TaSPM: Targeted Sequential Pattern Mining

A General and Parallel Platform for Mining Co-Movement Patterns over Large-scale Trajectories.

HAOP-Miner: Self-adaptive high-average utility one-off sequential pattern mining

Design and Implementation of Parallel DBSCAN Algorithm Based on Spark

HSNP-Miner: High Utility Self-Adaptive Nonoverlapping Pattern Mining

Mining Scalable Pattern Based on Temporal Logic over Data Streams

A Comparative Study of Frequent Pattern Mining with Trajectory Data

Study of ELM Algorithm Parallelization Based on Spark

Towards Top-$K$ Non-Overlapping Sequential Patterns

Parallel Frequent Pattern Discovery: Challenges and Methodology