Abstract:Existing Text-to-SQL generators require the entire schema to be encoded with the user text. This is expensive or impractical for large databases with tens of thousands of columns. Standard dense retrieval techniques are inadequate for schema subsetting of a large structured database, where the correct semantics of retrieval demands that we rank sets of schema elements rather than individual elements. In response, we propose a two-stage process for effective coverage during retrieval. First, we instruct an LLM to hallucinate a minimal DB schema deemed adequate to answer the query. We use the hallucinated schema to retrieve a subset of the actual schema, by composing the results from multiple dense retrievals. Remarkably, hallucination $\unicode{x2013}$ generally considered a nuisance $\unicode{x2013}$ turns out to be actually useful as a bridging mechanism. Since no existing benchmarks exist for schema subsetting on large databases, we introduce three benchmarks. Two semi-synthetic datasets are derived from the union of schemas in two well-known datasets, SPIDER and BIRD, resulting in 4502 and 798 schema elements respectively. A real-life benchmark called SocialDB is sourced from an actual large data warehouse comprising 17844 schema elements. We show that our method1 leads to significantly higher recall than SOTA retrieval-based augmentation methods.

What problem does this paper attempt to address?

The paper primarily addresses the issues present in large-scale databases, specifically the challenges faced by existing Text-to-SQL generators when dealing with databases containing thousands of columns. Specifically, the paper aims to solve the following problems: 1. **Handling large-scale databases**: Existing Text-to-SQL systems require the entire database schema as input, which is very expensive or impractical for large-scale databases with numerous tables and columns. 2. **Improving retrieval efficiency and accuracy**: Standard dense retrieval techniques are insufficient to meet the needs of selecting appropriate schema subsets from large structured databases, as these techniques typically focus on ranking individual elements rather than collections of elements. The proposed method (CRUSH4SQL) aims to efficiently address the above issues through a two-stage process: - **First stage**: Utilize a language model (LLM) for schema hallucination, generating a minimized database schema that is sufficient to answer the given query. - **Second stage**: Use the hallucinated schema to retrieve a subset of the actual schema and determine a high-recall yet small schema subset through an optimized combinatorial objective function. The main contributions of the paper include: - Proposing the CRUSH method, which combines the schema hallucination capabilities of LLMs with dense retrieval techniques to identify a small, high-recall schema subset needed for downstream Text-to-SQL stages. - Creating three new benchmark datasets to evaluate the retrieval enhancement effects in the Text-to-SQL conversion task, especially for large-scale databases. These include two semi-synthetic datasets (SpiderUnion and BirdUnion) and one dataset from a real production data warehouse (SocialDB). Through experimental comparisons on these benchmark datasets, the paper demonstrates consistent improvements in the recall of golden schema elements with CRUSH compared to existing methods, which directly translates to improved Text-to-SQL generation accuracy. Additionally, the paper analyzes the weaknesses of existing single embedding or token-level representation methods.

CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL

RSL-SQL: Robust Schema Linking in Text-to-SQL Generation

Improving Retrieval-augmented Text-to-SQL with AST-based Ranking and Schema Pruning

Schema-Aware Multi-Task Learning for Complex Text-to-SQL

MAG-SQL: Multi-Agent Generative Approach with Soft Schema Linking and Iterative Sub-SQL Refinement for Text-to-SQL

CHESS: Contextual Harnessing for Efficient SQL Synthesis

SQL-to-Schema Enhances Schema Linking in Text-to-SQL

CatSQL: Towards Real World Natural Language to SQL Applications.

Towards Text-to-SQL over Aggregate Tables

RYANSQL: Recursively Applying Sketch-based Slot Fillings for Complex Text-to-SQL in Cross-Domain Databases

Cooperative SQL Generation for Segmented Databases By Using Multi-functional LLM Agents

HFD: Hierarchical feature decoupling for SQL generation from text

RH-SQL: Refined Schema and Hardness Prompt for Text-to-SQL

E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL

Text2SQL is Not Enough: Unifying AI and Databases with TAG

RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL

SA-SQL: A Schema-Aligned Framework for Text-to-SQL Through Large Language Models

SQLfuse: Enhancing Text-to-SQL Performance through Comprehensive LLM Synergy

UNITE: A Unified Benchmark for Text-to-SQL Evaluation

CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL

Pushing the Boundaries of Crowd-enabled Databases with Query-driven Schema Expansion