PreQR: Pre-training Representation for SQL Understanding

Xiu Tang,Sai Wu,Mingli Song,Shanshan Ying,Feifei Li,Gang Chen
DOI: https://doi.org/10.1145/3514221.3517878
2022-01-01
Abstract:Recently, the learning-based models are shown to outperform the conventional methods for many database tasks such as cardinality estimation, join order selection and performance tuning. However, most existing learning-based methods adopt the one-hot encoding for SQL query representation, unable to catch complicated semantic context, e.g. structure of query, database schema definition and distribution variance of columns. To address such above problem, we propose a novel pre-trained SQL representation model, called PreQR, which extends the language representation approach to SQL queries. We propose an automaton to encode the query structures, and apply a graph neural network to encode database schema information conditioned on the query. A new SQL encoder is then established by adopting the attention mechanism to support on-the-fly query-aware schema linking. Experimental results on real datasets show that replacing the one-hot encoding with our query representation can significantly improve the performances of existing learning-based models on several database tasks.
What problem does this paper attempt to address?