An Incremental Partitioning Strategy for Data Balance on MapReduce
WANG Zhuo,CHEN Qun,LI Zhan-Huai,PAN Wei,YOU Li
DOI: https://doi.org/10.11897/sp.j.1016.2016.00019
2016-01-01
Chinese Journal of Computers
Abstract:MapReduce has been widely used in processing large data sets in a distributed cluster as a flexible computation model,such as log analysis,document clustering and other forms of data analytics.In the MapReduce open-source platform Hadoop,the default Hash/Range partition scheme usually results in unbalanced data load in the Reduce phase.Even though Hadoop allows users to define a partition function,it is difficult to achieve balanced data load without detailed information on data distribution.In this paper,we propose a novel multiple-round approach to balance data load in the Reduce phase.In our proposal,Mapper produces more fine-grained partitions than the number of Reducer and gathers the statistics on the sizes of fine-grained partitions.And then,JobTracker selects appropriate fine-grained partitions to be allocated to Reducers before running Reduce ()function.We introduce a cost model and propose a heuristic assignment algorithm for this task.Finally,we experimentally compare our approach with Closer, which uses a segment partition method,on both synthetic and real datasets.The experimental results show our method achieves more balanced data load.