Performance Evaluation And Optimization Of Join Operation In Spark For Big Data Processing

Deyang Qiu,Wenli Zhou,Jun Liu
DOI: https://doi.org/10.1109/compcomm.2017.8322944
2017-01-01
Abstract:Now Spark is the most important big data distributed computing framework. Join appears frequently in Spark programs, such as PageRank algorithm. Due to join operation will bring lots of Shuffle, thus it costs much time in big data processing. In this paper, we optimize each iteration and total application of PageRank algorithm using broadcast, cache and broadcast-cache in three ways respectively. Experiment data show that broadcast, cache can reduce Shuffle effectively, and thus greatly decrease running time of job. The effect is significant.
What problem does this paper attempt to address?