Phase-Reconfigurable Shuffle Optimization for Hadoop MapReduce
Jihe Wang,Meikang Qiu,Bing Guo,Ziliang Zong
DOI: https://doi.org/10.1109/tcc.2015.2459707
IF: 5.697
2015-01-01
IEEE Transactions on Cloud Computing
Abstract:Hadoop MapReduce is a leading open source framework that supports the realization of the Big Data revolution and serves as a pioneering platform in ultra large amount of information storing and processing. However, tuning a MapReduce system has become a difficult task because a large number of parameters restrict its performance, many of which are related with shuffle, a complicated phase between map and reduce functions, including sorting, grouping, and HTTP transferring. During shuffle phase, a large mount of time is spent on disk I/O due to the low speed of data throughput. In this paper, we build a mathematical model to judge the computing complexity of different operating orders within map-side shuffle, so that a faster execution can be achieved through reconfiguring the order of sorting and grouping. Furthermore, a three-dimensional exploring space of the performance is expanded, with which, some sampled features during shuffle stage, such as key number, spilling file number, and the variances of intermediate results, are collected to support the evaluation of computing complexity of each operating order. Thus, an optimized reconfiguration of map-side shuffle architecture can be achieved within Hadoop without extra disk I/O induced. Comparing with the original Hadoop implementation, the results show that our reconfigurable architecture gains up to 2.37 x speedup to finish the map-side shuffle work.