Abstract:In this paper, we propose a novel graph-based methodology to evaluate the functional correctness of SQL generation. Conventional metrics for assessing SQL code generation, such as matching-based and execution-based methods (e.g., exact set match and execution accuracy), are subject to two primary limitations. Firstly, the former fails to effectively assess functional correctness, as different SQL queries may possess identical functionalities. Secondly, the latter is susceptible to producing false positive samples in evaluations. Our proposed evaluation method, \texttt{FuncEvalGMN}, does not depend on the sufficient preparation of the test data, and it enables precise testing of the functional correctness of the code. Firstly, we parse SQL using a relational operator tree (ROT) called \textit{Relnode}, which contains rich semantic information from the perspective of logical execution.Then, we introduce a GNN-based approach for predicting the functional correctness of generated SQL. This approach incorporates global positional embeddings to address the limitations with the loss of topological information in conventional graph matching frameworks. As an auxiliary contribution, we propose a rule-based matching algorithm, Relnode Partial Matching (\texttt{RelPM}) as a baseline. Finally, we contribute a dataset, \texttt{Pair-Aug-Spider} with a training set and two testing sets, each comprising pairs of SQL codes to simulate various SQL code evaluation scenarios. The training set and one testing dataset focus on code generation using large language models (LLMs), while the other emphasizes SQL equivalence rewriting.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deficiencies of existing SQL generation evaluation methods in evaluating functional correctness. Specifically: 1. **Matching - based methods**: - For example, matching - based methods such as BLEU mainly rely on the n - gram overlap between the generated code and the reference code for evaluation. However, these methods only focus on basic and lexical - level features and cannot comprehensively capture functionally equivalent program variations. - These methods cannot effectively evaluate functional correctness because different SQL queries may have the same functionality. 2. **Execution - based methods**: - For example, Execution Accuracy evaluates by comparing the execution results of the predicted SQL and the standard SQL in the database. This method is prone to generating false - positive samples, especially when the test data is insufficient. - Execution - based methods require the preparation of a large amount of test data and have high requirements for the test environment and computing resources. To solve these problems, the author proposes a new graph - based evaluation method - **FuncEvalGMN**. The main goal of this method is to provide a more accurate and reliable way to evaluate the functional correctness of SQL generation. The specific steps include: - **Parsing SQL into a Relational Operator Tree (ROT)**: called RelNode, which contains rich logical execution semantic information. - **Introducing a graph neural network (GNN) - based method**: used to predict the functional correctness of the generated SQL. This method solves the problem of topology information loss in the traditional graph - matching framework through global positional embeddings. - **Constructing an auxiliary rule - matching algorithm**: RelNode Partial Matching (RelPM) as a baseline method. - **Contributing a new dataset Spider - Pair**: which contains a training set and two test sets, each set consisting of SQL code pairs to simulate various SQL code evaluation scenarios. Through these improvements, FuncEvalGMN can accurately test the functional correctness of the generated SQL without the need for a large amount of test data preparation, thus making up for the deficiencies of existing methods.

FuncEvalGMN: Evaluating Functional Correctness of SQL via Graph Matching Network

CatSQL: Towards Real World Natural Language to SQL Applications.

GNEM: A Generic One-to-Set Neural Entity Matching Framework

GTR: An SQL Generator With Transition Representation in Cross-Domain Database Systems

Towards Evaluating Large Language Models for Graph Query Generation

CodeScore: Evaluating Code Generation by Learning Code Execution

GDsmith: Detecting Bugs in Graph Database Engines

Semantic Parsing with Syntax- and Table-Aware SQL Generation

SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-DomainText-to-SQL Task

Exploring the Use of LLMs for SQL Equivalence Checking

GenCoG: A DSL-Based Approach to Generating Computation Graphs for TVM Testing.

MAG-SQL: Multi-Agent Generative Approach with Soft Schema Linking and Iterative Sub-SQL Refinement for Text-to-SQL

Effective Bug Detection in Graph Database Engines: An LLM-based Approach

GenSql: A Generative Natural Language Interface to Database Systems.

GraphEval2000: Benchmarking and Improving Large Language Models on Graph Datasets

Evaluating SQL Understanding in Large Language Models

Structure Guided Large Language Model for SQL Generation

SEGMN: A Structure-Enhanced Graph Matching Network for Graph Similarity Learning

Improved NL2SQL based on Multi-layer Expert Network

XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL

GraphTeam: Facilitating Large Language Model-based Graph Analysis via Multi-Agent Collaboration