A Comparative Study of Data Skew in Hadoop

Majun He,Guozhong Li,Chaojie Huang,Yufei Ye,Wenhong Tian
DOI: https://doi.org/10.1145/3171592.3171610
2017-01-01
Abstract:MapReduce which has been a well-known programming model processes numerous raw data in large scale clusters. However, great challenges have been brought to MapReduce programming model while routinely handling the big data. To mitigate the process time of the clusters through minimizing the makespan is one of the key challenges. For now, (data) skew is partly responsible for that and there are some methods presented by research teams from different perspectives. In order to fully understand and utilize the state-of-the-art of data skew problem, in this paper, we compare six algorithms: Hadoop default (speculative execution), SkewReduce, SkewTune, iShuffle, LEEN and LIBRA. They are compared in terms of architecture and main features, core algorithms, performance metrics and evaluation methods. Finally, a few challenging problems as future research trends are summarized.
What problem does this paper attempt to address?