Dynamic Memory-Aware Scheduling in Spark Computing Environment

Zhuo Tang,Ailing Zeng,Xuedong Zhang,Li Yang,Kenli Li
DOI: https://doi.org/10.1016/j.jpdc.2020.03.010
IF: 4.542
2020-01-01
Journal of Parallel and Distributed Computing
Abstract:Scheduling plays an important role in improving the performance of big data-parallel processing. Spark is an in-memory parallel computing framework that uses a multi-threaded model in task scheduling. Most Spark task scheduling processes do not take the memory into account, but the number of concurrent task threads determined by the user. It emerges as a potential limitation for the performance. To overcome the limitations in the Spark-core source code, this paper proposes a dynamic Spark memory-aware task scheduler (DMATS), which not only treats memory and network I/O as a computational resource but also dynamically adjusts concurrency when scheduling tasks. Specifically, we first analyze the RDD based Spark execution engine to obtain the amount of task processing data and propose an algorithm for estimating the initial adaptive task concurrency, which is integrated with the known task input information and the executor memory. Then, a dynamic adjustment algorithm is proposed to change the concurrency dynamically through feedback information to optimally utilize the limited memory resources. We implement a dynamic memory-aware task scheduling (DMATS) in Spark 2.3.4 and evaluate performance with two typical benchmarks, shuffle-light and shuffle-heavy. The results show that the algorithm not only reduces the execution time by 43.64%, but also significantly improves resource utilization. Experiments also show that our proposed method has advantages compared with other similar works such as WASP.
What problem does this paper attempt to address?