A MapReduce Cluster Deployment Optimization Framework with Geo-distributed Data.

Shanshan Li,Qinghua Lu,Weishan Zhang,Liming Zhu
DOI: https://doi.org/10.1109/uic-atc-scalcom-cbdcom-iop.2015.179
2015-01-01
Abstract:Big Data processing has become the common business needs in government and enterprise applications, e.g., Analysis or detection of climate change, economic development, or online customer behavior. Hadoop is the most mature open source big data processing framework, which implements the MapReduce programming paradigm. The mass source data are stored in HDFS supported by Hadoop and processed parallelly in computing nodes of a cluster. However, in many cases, the source data is simultaneously distributed across multiple data centers(Geo-distributed). Existing deployment research, merely focusing on moving all data to one data center to process, is often limited by the size of input data and the network transmission capacity between data centers, resulting in a lethal impact on the performance of big data processing. In this paper, we deal with Geo-distributed data sets, analyze possible cluster deployment way and then select the optimal one with the proposed cluster deployment optimization framework. We introduce decision making algorithm that the optimization framework relies on to determine an optimized cluster deployment way. In addition, we prove the benefit of our optimization framework by final experiment in Amazon EC2 over the common deployment for Geo-distributed data. The results show that the decision making algorithm is accurate and the optimization framework can significantly improve the Geo-distributed data processing performance by giving the optimized cluster deployment way.
What problem does this paper attempt to address?