SQLfuse: Enhancing Text-to-SQL Performance through Comprehensive LLM Synergy

Tingkai Zhang,Chaoyu Chen,Cong Liao,Jun Wang,Xudong Zhao,Hang Yu,Jianchao Wang,Jianguo Li,Wenhui Shi
2024-07-19
Abstract:Text-to-SQL conversion is a critical innovation, simplifying the transition from complex SQL to intuitive natural language queries, especially significant given SQL's prevalence in the job market across various roles. The rise of Large Language Models (LLMs) like GPT-3.5 and GPT-4 has greatly advanced this field, offering improved natural language understanding and the ability to generate nuanced SQL statements. However, the potential of open-source LLMs in Text-to-SQL applications remains underexplored, with many frameworks failing to leverage their full capabilities, particularly in handling complex database queries and incorporating feedback for iterative refinement. Addressing these limitations, this paper introduces SQLfuse, a robust system integrating open-source LLMs with a suite of tools to enhance Text-to-SQL translation's accuracy and usability. SQLfuse features four modules: schema mining, schema linking, SQL generation, and a SQL critic module, to not only generate but also continuously enhance SQL query quality. Demonstrated by its leading performance on the Spider Leaderboard and deployment by Ant Group, SQLfuse showcases the practical merits of open-source LLMs in diverse business contexts.
Computation and Language,Artificial Intelligence,Databases
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address several key issues in the conversion of text to SQL (Text-to-SQL) and enhance the accuracy and usability of this process by introducing a new system called SQLfuse. Specifically: 1. **Limitations of Existing Frameworks**: Current Text-to-SQL frameworks based on large language models (LLMs) fail to fully leverage the capabilities of open-source LLMs, particularly in handling complex database queries and integrating feedback for iterative improvements. 2. **Handling Complex Relationships**: Existing Text-to-SQL systems often overlook one-to-many relationships between tables and the correspondence between enumerated values and natural language, which is especially important in constructing aggregate queries. 3. **Utilizing Execution Error Feedback**: Existing systems typically do not use execution error feedback to correct inaccuracies in SQL, even though such feedback can provide valuable correction clues. 4. **Lack of Evaluation Module**: There is a lack of an evaluation module to assess and select the best SQL output generated by LLMs, which can significantly improve the quality of the results. To address these issues, the paper proposes the SQLfuse system, which consists of four synergistic modules: schema mining, schema linking, SQL generation (SQLgen), and SQL evaluation modules. These modules not only generate SQL queries but also continuously optimize to improve query quality. SQLfuse has performed excellently on the Spider Leaderboard, achieving an accuracy of 85.6%, and has been validated in practical applications at Ant Group.