Research on Query Analysis and Optimization Based on Spark

Yan Li,Hongbo Wang,Yangyang Li
DOI: https://doi.org/10.1109/iccsnt.2017.8343697
2017-01-01
Abstract:With the rapid development of the Internet and the explosive growth of information, the traditional technical frame-work can not meet the needs of massive data processing. In this environment, the research and development of big data platform came into being. Compared to Hadoop MapReduce programming model, the Spark computing framework has a better applicability by introducing RDD (Elastic Distributed Data Set) and memory-based computing model. SparkSQL is an api that integrates relational processing and Sparks functional programming. It provides a better choice for handing massive structured data. However, for the most complex and costly inter-table correlation queries in traditional query, Spark SQL's performance is poor. To some extent, it has affected the application of Spark. This paper first introduces the technical background of Spark architecture, Optimizer Catalyst, and then expounds the factors causing low query performance. Then, a design scheme of cost optimization and predicate pushdown is proposed based on Spark SQL. The proposed scheme is based on scalable Catalyst, which improves the performance degradation due to improper selection of table association algorithm and the triggering of shuffle. Finally, the Spark cluster test environment is built to verify the feasibility and performance improvement of the proposed scheme.
What problem does this paper attempt to address?