Join Query Optimization Based on MapReduce under Skewed Data

杨青,钱俊彦,尚宏佳,张敬伟,周萍
DOI: https://doi.org/10.3778/j.issn.1673-9418.1604022
2017-01-01
Abstract:MapReduce,a classic distributed computing environment,can improve the performance of join query on large-scale data,but when the join attributes do not follow a uniform distribution,the pure hash strategy in traditional MapReduce will lead to load imbalance over computing nodes,which will reduce the performance of overall task.Aiming at the data skew problem in the join query,this paper studies the join query optimization based on MapReduce computing framework.Firstly,this paper conducts experimental analysis for the improved repartitioning join query algorithm,studies the execution phases of join query based on traditional MapReduce computing framework,and finds the performance bottlenecks of join query on MapReduce computing framework when data do not follow a uniform distribution.Based on the above,this paper designs and implements an improved join query optimization algorithm,which is based on an execution strategy by integrating the combination segmentation method and equilibrium partitioning method.The experimental results show that the proposed optimization method provides a good solution for distributed join query on large-scale skewed datasets,and presents an excellent time performance and scalability.
What problem does this paper attempt to address?