Abstract:In sophisticated existing Text-to-SQL methods exhibit errors in various proportions, including schema-linking errors (incorrect columns, tables, or extra columns), join errors, nested errors, and group-by errors. Consequently, there is a critical need to filter out unnecessary tables and columns, directing the language models attention to relevant tables and columns with schema-linking, to reduce errors during SQL generation. Previous approaches have involved sorting tables and columns based on their relevance to the question, selecting the top-ranked ones for sorting, or directly identifying the necessary tables and columns for SQL generation. However, these methods face challenges such as lengthy model training times, high consumption of expensive GPT-4 tokens in few-shot prompts, or suboptimal performance in schema linking. Therefore, we propose an inventive schema linking method in two steps: Firstly, generate an initial SQL query by utilizing the complete database schema. Subsequently, extract tables and columns from the initial SQL query to create a concise schema. Using CodeLlama-34B, when comparing the schemas obtained by mainstream methods with ours for SQL generation, our schema performs optimally. Leveraging GPT4, our SQL generation method achieved results that are comparable to mainstream Text-to-SQL methods on the Spider dataset.

What problem does this paper attempt to address?

This paper aims to solve various error problems existing in the existing Text - to - SQL methods, including schema linking errors (such as incorrect columns, tables or extra columns), join errors, nesting errors and grouping errors. These problems lead to errors in generating SQL queries, especially when dealing with complex queries. Therefore, the paper proposes an innovative schema linking method to improve the accuracy of SQL generation through a two - step method: 1. **Generate the initial SQL query**: Use the complete database schema to generate a preliminary SQL query. 2. **Extract the schema**: Extract tables and columns from the initial SQL query to form a concise schema. The main goal of this method is to reduce unnecessary tables and columns, guide the language model to focus on relevant tables and columns, and thus reduce errors in the SQL generation process. The paper also proposes new evaluation metrics (such as Table Recall@4), so that researchers can quickly verify the effectiveness of schema linking without waiting for the entire SQL generation process to be completed. Specifically, the main contributions of the paper include: - **Propose for the first time the method of extracting the linking schema from the initial SQL (SQL - to - Schema)**, and define the evaluation metrics of the schema linking module, enabling researchers to quickly verify the effect of schema linking. - **Compared with the existing mainstream methods, when using the extracted linking schema to generate SQL, the execution accuracy is the highest**, especially when using CodeLlama - 34B. - **When combined with the complete Text - to - SQL method, the SQL generated by GPT - 4 performs better than all zero - shot and few - shot prompting methods on the Spider dataset**, which indicates that the improvement of schema linking can be transferred to the overall Text - to - SQL task. In conclusion, this paper significantly improves the performance of the Text - to - SQL task by proposing a new schema linking method and provides a new direction for future research.

SQL-to-Schema Enhances Schema Linking in Text-to-SQL

RSL-SQL: Robust Schema Linking in Text-to-SQL Generation

The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models

Semantic Enhanced Text-to-SQL Parsing Via Iteratively Learning Schema Linking Graph

SA-SQL: A Schema-Aligned Framework for Text-to-SQL Through Large Language Models

E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL

RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL

SEA-SQL: Semantic-Enhanced Text-to-SQL with Adaptive Refinement

SQLfuse: Enhancing Text-to-SQL Performance through Comprehensive LLM Synergy

Schema-Aware Multi-Task Learning for Complex Text-to-SQL

CatSQL: Towards Real World Natural Language to SQL Applications.

Improving Retrieval-augmented Text-to-SQL with AST-based Ranking and Schema Pruning

PET-SQL: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL with Cross-consistency

MAG-SQL: Multi-Agent Generative Approach with Soft Schema Linking and Iterative Sub-SQL Refinement for Text-to-SQL

CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL

RH-SQL: Refined Schema and Hardness Prompt for Text-to-SQL

Structure Guided Large Language Model for SQL Generation

Evaluating LLMs for Text-to-SQL Generation With Complex SQL Workload

LR-SQL: A Supervised Fine-Tuning Method for Text2SQL Tasks under Low-Resource Scenarios

CodeS: Towards Building Open-source Language Models for Text-to-SQL

XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL