Abstract:Complex data mining algorithms are processed in multiple iterations, where output of one iteration is used as input for the subsequent iterations. Existing parallel programming frameworks, e.g., MapReduce, Pregel and Spark, adopt the breadth first search (BFS) strategy to process those iterative jobs. They invoke the user-defined functions for every key-value pair or vertex to produce all possible intermediate results for the next iteration. Such BFS strategy incurs high I/O overheads, because normally, the size of intermediate search results of BFS is exponential to the size of original data, making it impossible to maintain those intermediate results in memory. In this paper, we present a new type of parallel programming model, the stack-centric model, where all computations are defined for a stack maintained in the distributed shared memory. The stack can be adaptively split into multiple stacks and disseminated to different compute nodes for parallel processing. The most distinguished feature of the stack-centric model is its support for the depth first search (DFS) algorithm which incurs much less memory overhead than its BFS counterpart. The maximal memory usage of DFS algorithm is determined by the height of its search tree, and hence, it is possible to conduct the computation of DFS algorithm mostly in memory. Our stack-centric model is not a pure DFS framework. It supports the hybrid BFS and DFS algorithms by tuning the trade-off between memory usage and parallelism. To show the advantages of stack-centric model, we implement two algorithms, frequent pattern mining algorithm and DNA sequence matching algorithm, on both stack-centric model and Spark. The memory usage of stack-centric model is 10 times less than the Spark, resulting in a significant performance improvement.

Memory optimization of Spark parallel computing framework

An Improved Memory Cache Management Study Based on Spark

Adaptive memory reservation strategy for heavy workloads in the Spark environment

MCS: Memory Constraint Strategy for Unified Memory Manager in Spark.

Data Object Cache in Spark Computing Engine

Design and Implementation of Parallel DBSCAN Algorithm Based on Spark

OPTIMIZATION FOR SPARK MISSION PERFORMANCE BASED ON DATA CHARACTERISTICS

MespaConfig: Memory-Sparing Configuration Auto-Tuning for Co-Located In-Memory Cluster Computing Jobs

Optimizing Resource Allocation for Data-Parallel Jobs Via GCN-Based Prediction

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

Towards General and Efficient Online Tuning for Spark

Improving Spark Performance with Zero-Copy Buffer Management and RDMA

The Optimization of Cost-Model for Join Operator on Spark SQL Platform

A data affinity based garbage collector for multi-bank flash-memory storage system

A Stack-Centric Processing Model for Iterative Processing

Ruya: Memory-Aware Iterative Optimization of Cluster Configurations for Big Data Processing

The parallel algorithms for LIBSVM parameter optimization based on Spark

Improving In-Memory File System Reading Performance by Fine-Grained User-Space Cache Mechanisms

Performance Analysis and Optimization of Full Garbage Collection in Memory-hungry Environments

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment