FuncEvalGMN: Evaluating Functional Correctness of SQL via Graph Matching Network

Yi Zhan,Yang Sun,Han Weng,Longjie Cui,Guifeng Wang,Jiajun Xie,Yu Tian,Xiaoming Yin,Boyi Liu,Dongchi Huang
2024-07-09
Abstract:In this paper, we propose a novel graph-based methodology to evaluate the functional correctness of SQL generation. Conventional metrics for assessing SQL code generation, such as matching-based and execution-based methods (e.g., exact set match and execution accuracy), are subject to two primary limitations. Firstly, the former fails to effectively assess functional correctness, as different SQL queries may possess identical functionalities. Secondly, the latter is susceptible to producing false positive samples in evaluations. Our proposed evaluation method, \texttt{FuncEvalGMN}, does not depend on the sufficient preparation of the test data, and it enables precise testing of the functional correctness of the code. Firstly, we parse SQL using a relational operator tree (ROT) called \textit{Relnode}, which contains rich semantic information from the perspective of logical execution.Then, we introduce a GNN-based approach for predicting the functional correctness of generated SQL. This approach incorporates global positional embeddings to address the limitations with the loss of topological information in conventional graph matching frameworks. As an auxiliary contribution, we propose a rule-based matching algorithm, Relnode Partial Matching (\texttt{RelPM}) as a baseline. Finally, we contribute a dataset, \texttt{Pair-Aug-Spider} with a training set and two testing sets, each comprising pairs of SQL codes to simulate various SQL code evaluation scenarios. The training set and one testing dataset focus on code generation using large language models (LLMs), while the other emphasizes SQL equivalence rewriting.
Databases,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiencies of existing SQL generation evaluation methods in evaluating functional correctness. Specifically: 1. **Matching - based methods**: - For example, matching - based methods such as BLEU mainly rely on the n - gram overlap between the generated code and the reference code for evaluation. However, these methods only focus on basic and lexical - level features and cannot comprehensively capture functionally equivalent program variations. - These methods cannot effectively evaluate functional correctness because different SQL queries may have the same functionality. 2. **Execution - based methods**: - For example, Execution Accuracy evaluates by comparing the execution results of the predicted SQL and the standard SQL in the database. This method is prone to generating false - positive samples, especially when the test data is insufficient. - Execution - based methods require the preparation of a large amount of test data and have high requirements for the test environment and computing resources. To solve these problems, the author proposes a new graph - based evaluation method - **FuncEvalGMN**. The main goal of this method is to provide a more accurate and reliable way to evaluate the functional correctness of SQL generation. The specific steps include: - **Parsing SQL into a Relational Operator Tree (ROT)**: called RelNode, which contains rich logical execution semantic information. - **Introducing a graph neural network (GNN) - based method**: used to predict the functional correctness of the generated SQL. This method solves the problem of topology information loss in the traditional graph - matching framework through global positional embeddings. - **Constructing an auxiliary rule - matching algorithm**: RelNode Partial Matching (RelPM) as a baseline method. - **Contributing a new dataset Spider - Pair**: which contains a training set and two test sets, each set consisting of SQL code pairs to simulate various SQL code evaluation scenarios. Through these improvements, FuncEvalGMN can accurately test the functional correctness of the generated SQL without the need for a large amount of test data preparation, thus making up for the deficiencies of existing methods.