SQL-to-Schema Enhances Schema Linking in Text-to-SQL

Sun Yang,Qiong Su,Zhishuai Li,Ziyue Li,Hangyu Mao,Chenxi Liu,Rui Zhao
2024-05-15
Abstract:In sophisticated existing Text-to-SQL methods exhibit errors in various proportions, including schema-linking errors (incorrect columns, tables, or extra columns), join errors, nested errors, and group-by errors. Consequently, there is a critical need to filter out unnecessary tables and columns, directing the language models attention to relevant tables and columns with schema-linking, to reduce errors during SQL generation. Previous approaches have involved sorting tables and columns based on their relevance to the question, selecting the top-ranked ones for sorting, or directly identifying the necessary tables and columns for SQL generation. However, these methods face challenges such as lengthy model training times, high consumption of expensive GPT-4 tokens in few-shot prompts, or suboptimal performance in schema linking. Therefore, we propose an inventive schema linking method in two steps: Firstly, generate an initial SQL query by utilizing the complete database schema. Subsequently, extract tables and columns from the initial SQL query to create a concise schema. Using CodeLlama-34B, when comparing the schemas obtained by mainstream methods with ours for SQL generation, our schema performs optimally. Leveraging GPT4, our SQL generation method achieved results that are comparable to mainstream Text-to-SQL methods on the Spider dataset.
Databases,Artificial Intelligence
What problem does this paper attempt to address?
This paper aims to solve various error problems existing in the existing Text - to - SQL methods, including schema linking errors (such as incorrect columns, tables or extra columns), join errors, nesting errors and grouping errors. These problems lead to errors in generating SQL queries, especially when dealing with complex queries. Therefore, the paper proposes an innovative schema linking method to improve the accuracy of SQL generation through a two - step method: 1. **Generate the initial SQL query**: Use the complete database schema to generate a preliminary SQL query. 2. **Extract the schema**: Extract tables and columns from the initial SQL query to form a concise schema. The main goal of this method is to reduce unnecessary tables and columns, guide the language model to focus on relevant tables and columns, and thus reduce errors in the SQL generation process. The paper also proposes new evaluation metrics (such as Table Recall@4), so that researchers can quickly verify the effectiveness of schema linking without waiting for the entire SQL generation process to be completed. Specifically, the main contributions of the paper include: - **Propose for the first time the method of extracting the linking schema from the initial SQL (SQL - to - Schema)**, and define the evaluation metrics of the schema linking module, enabling researchers to quickly verify the effect of schema linking. - **Compared with the existing mainstream methods, when using the extracted linking schema to generate SQL, the execution accuracy is the highest**, especially when using CodeLlama - 34B. - **When combined with the complete Text - to - SQL method, the SQL generated by GPT - 4 performs better than all zero - shot and few - shot prompting methods on the Spider dataset**, which indicates that the improvement of schema linking can be transferred to the overall Text - to - SQL task. In conclusion, this paper significantly improves the performance of the Text - to - SQL task by proposing a new schema linking method and provides a new direction for future research.