A Learned Cost Model for Big Data Query Processing

Yan Li,Liwei Wang,Sheng Wang,Yuan Sun,Bolong Zheng,Zhiyong Peng
DOI: https://doi.org/10.1016/j.ins.2024.120650
2022-01-01
Abstract:The efficiency of query processing is highly affected by execution plans and allocated resources in the Spark SQL big data processing engine. However, the cost models for Spark SQL are still based on hand-crafted rules. The learning-based cost models have been proposed for relational databases, but it does not consider the effect of the available resources. To address this, we propose a resource-aware deep learning model that can automatically predict the execution time of query plans based on historical data. To train our model, we embed the query execution plans based on the query plan tree and extract features from the allocated resources. A deep learning model with adaptive attention mechanisms is then trained to predict the execution time of query plans. The experiments show that our deep cost model can achieve higher accuracy in predicting the execution time of query plans compared to traditional rule-based methods and relational database learning-based optimizers.
What problem does this paper attempt to address?