An Optimal Locality-Aware Task Scheduling Algorithm Based on Bipartite Graph Modelling for Spark Applications

Zhongming Fu,Zhuo Tang,Li Yang,Chubo Liu
DOI: https://doi.org/10.1109/tpds.2020.2992073
IF: 5.3
2020-10-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:In the distributed computing framework of Spark, cross-node/rack data transfer produced by map tasks and reduce tasks are common problems resulting in performance degradation, such as prolonging of entire execution time and network congestion. To address these problems, this article utilizes the bipartite graph modelling to propose an optimal locality-aware task scheduling algorithm. By considering global optimality, the algorithm can generate the optimal scheduling solution for both the map tasks and the reduce tasks for data locality. Because of the different communication modes, this article uses a unified graph to model the map task scheduling and the reduce task scheduling respectively. Then, by calculating the communication cost matrix of tasks, we formulate an optimal task scheduling scheme to minimize overall communication cost and transform the problem as the well-known graph problem: minimum weighted bipartite matching (MWBM), which can be resolved by Kuhn-Munkres algorithm. In addition, this article proposes a locality-aware executor allocation strategy to improve the data locality further. We implement our algorithm and strategy in Spark-2.4.1 and evaluate its performance using several representative micro-benchmarks, macro-benchmarks, and HiBench benchmark suite. The experimental results verify that by reducing the network traffic and access latency, the proposed algorithm can improve the job performance substantially compared to some other task scheduling algorithms.
computer science, theory & methods,engineering, electrical & electronic
What problem does this paper attempt to address?
This paper attempts to solve the cross - node/rack data transfer problems generated by map tasks and reduce tasks in the distributed computing framework Spark. These problems can lead to performance degradation, such as an increase in the overall execution time and network congestion. To solve these problems, this paper proposes a locality - based optimal task scheduling algorithm using bipartite graph modeling. By considering global optimality, this algorithm can generate optimal scheduling solutions for map tasks and reduce tasks to optimize data locality. Due to different communication patterns, this paper uses a unified graph to model map task scheduling and reduce task scheduling respectively. Then, by calculating the communication cost matrix of tasks, an optimal task scheduling scheme that minimizes the overall communication cost is formulated, and the problem is transformed into a well - known graph theory problem - Minimum Weighted Bipartite Matching (MWBM), which can be solved by the Kuhn - Munkres algorithm. In addition, this paper also proposes a locality - based executor allocation strategy to further improve data locality. The experimental results verify that the proposed algorithm can significantly reduce network traffic and access latency, thereby greatly improving job performance.