Efficient. Scalable and Robust Data Shuffle Service for Distributed MapReduce Computing on Cloud
Rong Gu,Xu Huang,Haipeng Dai,Xiaoyu Geng,Xiaofei Chen,Yihua Huang,Fu Xiao,Guihai Chen
DOI: https://doi.org/10.1109/hpcc-dss-smartcity-dependsys57074.2022.00075
2022-01-01
Abstract:Distributed Map Reduce computing frameworks, such as Hadoop, Spark, and Flink, are widely used in various domains which face big data challenges. Inside Map Reduce, Shuffle is a critical stage that bridges up the Map stage and Reduce stage through data partition and network transmission under the hood. With the rapidly growing scale of data and clusters, the all-to-all data transfer in Shuffle becomes inefficient and frequently fails, making Shuffle the bottleneck of many big data processing jobs. To improve the Shuffle efficiency, we propose a hybrid Shuffle model based on file pre-merging and partition reorganization mechanisms. To address the Shuf-fle stability issue, we design a novel data transmission model based on the replication mechanism. Finally, to support scalable Shuffle with high performance, we present a Shuffle parallelism degree tuning approach based on machine learning and searching techniques. We have implemented the proposed Shuffle models and strategies into the widely-used MapReduce-based big data platform Spark for evaluation. Experimental evaluation shows that, compared with the state-of-the-art systems, our proposed methods outperform them by 23% on average. As a practical application case, the proposed Shuffle service has been running in the real-world environment of ByteDance. Inc over one year with significant performance improvement.