CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL

Mayank Kothyari,Dhruva Dhingra,Sunita Sarawagi,Soumen Chakrabarti
2023-11-02
Abstract:Existing Text-to-SQL generators require the entire schema to be encoded with the user text. This is expensive or impractical for large databases with tens of thousands of columns. Standard dense retrieval techniques are inadequate for schema subsetting of a large structured database, where the correct semantics of retrieval demands that we rank sets of schema elements rather than individual elements. In response, we propose a two-stage process for effective coverage during retrieval. First, we instruct an LLM to hallucinate a minimal DB schema deemed adequate to answer the query. We use the hallucinated schema to retrieve a subset of the actual schema, by composing the results from multiple dense retrievals. Remarkably, hallucination $\unicode{x2013}$ generally considered a nuisance $\unicode{x2013}$ turns out to be actually useful as a bridging mechanism. Since no existing benchmarks exist for schema subsetting on large databases, we introduce three benchmarks. Two semi-synthetic datasets are derived from the union of schemas in two well-known datasets, SPIDER and BIRD, resulting in 4502 and 798 schema elements respectively. A real-life benchmark called SocialDB is sourced from an actual large data warehouse comprising 17844 schema elements. We show that our method1 leads to significantly higher recall than SOTA retrieval-based augmentation methods.
Computation and Language
What problem does this paper attempt to address?
The paper primarily addresses the issues present in large-scale databases, specifically the challenges faced by existing Text-to-SQL generators when dealing with databases containing thousands of columns. Specifically, the paper aims to solve the following problems: 1. **Handling large-scale databases**: Existing Text-to-SQL systems require the entire database schema as input, which is very expensive or impractical for large-scale databases with numerous tables and columns. 2. **Improving retrieval efficiency and accuracy**: Standard dense retrieval techniques are insufficient to meet the needs of selecting appropriate schema subsets from large structured databases, as these techniques typically focus on ranking individual elements rather than collections of elements. The proposed method (CRUSH4SQL) aims to efficiently address the above issues through a two-stage process: - **First stage**: Utilize a language model (LLM) for schema hallucination, generating a minimized database schema that is sufficient to answer the given query. - **Second stage**: Use the hallucinated schema to retrieve a subset of the actual schema and determine a high-recall yet small schema subset through an optimized combinatorial objective function. The main contributions of the paper include: - Proposing the CRUSH method, which combines the schema hallucination capabilities of LLMs with dense retrieval techniques to identify a small, high-recall schema subset needed for downstream Text-to-SQL stages. - Creating three new benchmark datasets to evaluate the retrieval enhancement effects in the Text-to-SQL conversion task, especially for large-scale databases. These include two semi-synthetic datasets (SpiderUnion and BirdUnion) and one dataset from a real production data warehouse (SocialDB). Through experimental comparisons on these benchmark datasets, the paper demonstrates consistent improvements in the recall of golden schema elements with CRUSH compared to existing methods, which directly translates to improved Text-to-SQL generation accuracy. Additionally, the paper analyzes the weaknesses of existing single embedding or token-level representation methods.