Performance Evaluation for Distributed Join Based on MapReduce.

Jingwei Zhang,Qing Yang,Hongjia Shang,Huibing Zhang,Yuming Lin,Rui Zhou
DOI: https://doi.org/10.1109/ccbd.2016.065
2016-01-01
Abstract:Inner-Join is a fundamental and frequent operation in large-scale data analysis. MapReduce is the most widely available framework in large-scale data analysis. A variety of inner-join algorithms are put forward to run on the MapReduce environment. Usually, those algorithms are designed for specific scenarios, but inner-join could present very different performance when data volume, reference ratio, data skew rate, and running environments et al are varied. This paper summarized and implemented those well-known join algorithms in a uniform MapReduce environment. Considering the number of tables, broadcast cost, data skew, join rate and related factors, we designed and conducted a large number of experiments to compare the time cost of those join algorithms. According to the experimental results, we analyzed and summarized the performance and applicability of those algorithms in different scenarios, which would be a reference of performance improvement for large-scale data analysis under different circumstances.
What problem does this paper attempt to address?