Push-based Network-efficient Hadoop YARN Scheduling Mechanism for In-memory Computing

Rong Gu,Kaixuan Huang,Zhixiang Zhang,Chunfeng Yuan,Yihua Huang
DOI: https://doi.org/10.1109/icpads47876.2019.00026
2019-01-01
Abstract:In the big data era, data-intensive cluster computing systems like Hadoop, have gained much popularity, and YARN, the second generation of Hadoop becomes the general resource manager in the Hadoop ecosystem. In the distributed computing scenarios, data locality (scheduling tasks on where the data resides) is essential to the performance since higher data locality brings lower network transmission cost and higher throughput. However, we find that the native YARN scheduling mechanism has little data locality and the delay scheduling strategy leads to the long-tail effect while achieving data locality for in-memory computing scenarios. Therefore, in this paper we propose the push-based YARN scheduling mechanism for the in-memory computing environment. First, we classify the Resource Requests into various categories. Then, we prune the non-local Resource Requests to achieve fast datalocality in-memory computation. Finally, we push the left longtail Resource Requests to the data-locality nodes to avoid the long-tail effect. The experimental results demonstrate that the proposed scheduling mechanism achieves nearly 100% datalocality percentage comparing to the native YARN scheduling mechanism that only achieves 10% 20% data-locality percentage. Under the identical data-locality percentage, the proposed push based scheduling mechanism promotes nearly 20% throughput and reduces nearly 10% application running time comparing to the existing delay scheduling mechanism used in YARN.
What problem does this paper attempt to address?