Uncoupled MapReduce: A Balanced and Efficient Data Transfer Model

Jie Zhang,Maosen Sun,Jian Lin,Li Zha
DOI: https://doi.org/10.1007/978-3-642-40131-2_4
2013-01-01
Abstract:In the MapReduce model, reduce tasks need to fetch output data of map tasks in the manner of \"pull\". However, reduce tasks which are occupying reduce slots cannot start to compute until all the corresponding map tasks are completed. It forms the dependence between map and reduce tasks, which is called the coupled relationship in this paper. The coupled relationship leads to two problems, reduce slot hoarding and underutilized network bandwidth. We propose an uncoupled intermediate data transfer model in order to address these problems. Three core techniques, including weighted mapping, data pushing, and partial data backup are introduced and applied in Apache Hadoop, the mainstream open-source implementation of MapReduce model. This work has been practised in Baidu, the biggest search engine company in China. A real-world application for web data processing shows that our model can improve the system throughput by 29.5%, reduce the total wall time by 22.8%, and provide a weighted wall time acceleration of 26.3%. What's more, the implementation of this model is transparent to user jobs and compatible with the original Hadoop.
What problem does this paper attempt to address?