DataGpt-SQL-7B: An Open-Source Language Model for Text-to-SQL

Lixia Wu,Peng Li,Junhong Lou,Lei Fu
2024-09-24
Abstract:In addressing the pivotal role of translating natural language queries into SQL commands, we propose a suite of compact, fine-tuned models and self-refine mechanisms to democratize data access and analysis for non-expert users, mitigating risks associated with closed-source Large Language Models. Specifically, we constructed a dataset of over 20K sample for Text-to-SQL as well as the preference dateset, to improve the efficiency in the domain of SQL generation. To further ensure code validity, a code corrector was integrated into the model. Our system, DataGpt-sql, achieved 87.2\% accuracy on the spider-dev, respectively, showcasing the effectiveness of our solution in text-to-SQL conversion tasks. Our code, data, and models are available at \url{<a class="link-external link-https" href="https://github.com/CainiaoTechAi/datagpt-sql-7b" rel="external noopener nofollow">this https URL</a>}
Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the problem of converting natural language queries into SQL commands. Specifically, the research team proposed a compact, fine-tuned model series named DataGpt-sql, along with a self-optimization mechanism, to enable non-expert users to easily access and analyze data. The paper mainly addresses the following aspects: 1. **Improving SQL generation efficiency**: By constructing a dataset containing more than 20,000 samples and utilizing Cross-DB and Inner-DB enhancement methods to improve the model's performance in recognizing correct patterns and columns. 2. **Ensuring code validity**: Introducing a code corrector to further ensure that the generated SQL code conforms to specific syntax standards and reduces errors. 3. **Enhancing execution accuracy**: By further fine-tuning the model through Direct Preference Optimization (DPO) to improve the accuracy of the generated SQL code in actual execution. Experimental results show that DataGpt-sql achieved an execution accuracy (EX) of 87.2% and a test suite accuracy (TS) of 83.5% in the Spider-dev benchmark, significantly outperforming other existing models. Additionally, the paper proposes an efficient reflective agent mechanism based on execution result feedback, further enhancing the system's reliability and accuracy. In summary, this research aims to reduce reliance on closed-source large language models by developing fine-tuned models specifically for text-to-SQL tasks, while also improving the convenience and security of database interaction for non-expert users.