SQLPrompt: In-Context Text-to-SQL with Minimal Labeled Data

Ruoxi Sun,Sercan Ö. Arik,Rajarishi Sinha,Hootan Nakhost,Hanjun Dai,Pengcheng Yin,Tomas Pfister
2023-11-06
Abstract:Text-to-SQL aims to automate the process of generating SQL queries on a database from natural language text. In this work, we propose "SQLPrompt", tailored to improve the few-shot prompting capabilities of Text-to-SQL for Large Language Models (LLMs). Our methods include innovative prompt design, execution-based consistency decoding strategy which selects the SQL with the most consistent execution outcome among other SQL proposals, and a method that aims to improve performance by diversifying the SQL proposals during consistency selection with different prompt designs ("MixPrompt") and foundation models ("MixLLMs"). We show that \emph{SQLPrompt} outperforms previous approaches for in-context learning with few labeled data by a large margin, closing the gap with finetuning state-of-the-art with thousands of labeled data.
Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is the automated generation of SQL queries from natural language (Text-to-SQL) using a small amount of labeled data. Specifically, the paper proposes a method called "SQLPrompt," which aims to enhance the Text-to-SQL capabilities of large language models (LLMs) under few-shot prompting. The paper focuses on the following aspects: 1. **Innovative Prompt Design**: Guiding the model to generate diverse SQL queries through different prompt designs. 2. **Execution-based Consistency Decoding Strategy**: Improving accuracy by selecting the SQL query with the most consistent execution results. 3. **Methods for Diverse SQL Proposals**: Enhancing the diversity of SQL proposals by combining different prompt designs ("MixPrompt") and different base models ("MixLLMs"). The purpose of these methods is to improve the performance of the Text-to-SQL task with a small amount of labeled data, thereby reducing the reliance on large amounts of labeled data, lowering the requirements for adapting to data, and reducing the risks of overfitting and poor generalization.