Efficient Shuffle Management for DAG Computing Frameworks Based on the FRQ Model

Rui Ren,Chunghsuan Wu,Zhouwang Fu,Tao Song,Yanqiang Liu,Zhengwei Qi,Haibing Guan
DOI: https://doi.org/10.1016/j.jpdc.2020.11.008
IF: 4.542
2021-01-01
Journal of Parallel and Distributed Computing
Abstract:In large-scale data-parallel analytics, shuffle, namely the cross-network read and the aggregation of partitioned data between tasks with data dependencies, usually bring in large overhead. To reduce shuffle overhead, we present SCache, an open-source plug-in system that particularly focuses on shuffle optimization. SCache adopts heuristic pre-scheduling combining with shuffle size prediction to pre-fetch shuffle data and balance load on each node. Meanwhile, SCache takes full advantage of the system memory to accelerate the shuffle process. We also propose a new performance model called Framework Resources Quantification (FRQ) model to analyze DAG frameworks and evaluate the SCache shuffle optimization. The FRQ model quantifies the utilization of resources and predicts the execution time of each phase of DAG jobs. We have implemented SCache on both Spark and Hadoop MapReduce. The performance of SCache has been evaluated with both simulations and testbed experiments on a 50-node Amazon EC2 cluster. Those evaluations have demonstrated that, by incorporating SCache, the shuffle overhead of Spark can be reduced by nearly 89%, and the overall completion time of TPC-DS queries improves 40% on average. On Apache Hadoop MapReduce, SCache optimizes end-to-end Terasort completion time by 15%.
What problem does this paper attempt to address?