Joint Optimization of MapReduce Scheduling and Network Policy in Hierarchical Data Centers
Donglin Yang,Dazhao Cheng,Wei Rang,Yu Wang
DOI: https://doi.org/10.1109/tcc.2019.2961653
IF: 5.697
2022-01-01
IEEE Transactions on Cloud Computing
Abstract:As large-scale data analytic becomes norm in various industries, using MapReduce frameworks to analyze ever-increasing volumes of data will keep growing. In turn, this trend drives up the intention to move MapReduce into multi-tenant clouds. However, the application performance of MapReduce can be significantly affected by the time-varying network bandwidth in a shared cluster. Although many recent studies improve MapReduce performance by dynamic scheduling to reduce the shuffle traffic, most of them do not consider the impact by widely existing hierarchical network architectures in data centers. In this article, we propose and design a hierarchical topology (Hit) aware MapReduce scheduler to minimize overall data traffic cost and hence to reduce job execution time. We first formulate the problem as a Topology Aware Assignment (TAA) optimization problem while considering dynamic computing and communication resources in the cloud with hierarchical network architecture. We further develop a synergistic strategy to solve the TAA problem by using the stable matching theory, which ensures the preference of both individual tasks and hosting machines. Finally, we implement the proposed scheduler as a pluggable module on Hadoop YARN and evaluate its performance by testbed experiments and simulations. The testbed experimental results show that Hit-scheduler can improve job completion time by 28 and 11 percent compared to Capacity Scheduler and Probabilistic Network-Aware scheduler, respectively. Our simulations further demonstrate that Hit-scheduler can reduce the traffic cost by 38 percent at most and the average shuffle flow traffic time by 32 percent compared to capacity scheduler. In this article, we have extended Hit-scheduler to a decentralized heuristic scheme to perform the policy-aware allocation in data center environments. Many existing centralized approximation approaches are too complex and infeasible to implement over a data center, which typically include lar-e amounts of servers, containers, switches, and traffic flows. In the extension, we have designed a decentralized heuristic scheme to perform the Policy-Aware Task (PAT) allocation by using existing centralize algorithm to approximately maximize the total gained utility. Finally, the simulation based experimental results show that the proposed PAT policy reduces the communication cost by 33.6 percent compared with the default scheduler in data centers.
computer science, information systems, theory & methods