CodeS: Towards Building Open-source Language Models for Text-to-SQL
Haoyang Li,Jing Zhang,Hanbing Liu,Ju Fan,Xiaokang Zhang,Jun Zhu,Renjie Wei,Hongyan Pan,Cuiping Li,Hong Chen
2024-02-26
Abstract:Language models have shown promising performance on the task of translating
natural language questions into SQL queries (Text-to-SQL). However, most of the
state-of-the-art (SOTA) approaches rely on powerful yet closed-source large
language models (LLMs), such as ChatGPT and GPT-4, which may have the
limitations of unclear model architectures, data privacy risks, and expensive
inference overheads. To address the limitations, we introduce CodeS, a series
of pre-trained language models with parameters ranging from 1B to 15B,
specifically designed for the text-to-SQL task. CodeS is a fully open-source
language model, which achieves superior accuracy with much smaller parameter
sizes. This paper studies the research challenges in building CodeS. To enhance
the SQL generation abilities of CodeS, we adopt an incremental pre-training
approach using a specifically curated SQL-centric corpus. Based on this, we
address the challenges of schema linking and rapid domain adaptation through
strategic prompt construction and a bi-directional data augmentation technique.
We conduct comprehensive evaluations on multiple datasets, including the widely
used Spider benchmark, the newly released BIRD benchmark, robustness-diagnostic
benchmarks such as Spider-DK, Spider-Syn, Spider-Realistic, and Dr.Spider, as
well as two real-world datasets created for financial and academic
applications. The experimental results show that our CodeS achieves new SOTA
accuracy and robustness on nearly all challenging text-to-SQL benchmarks.
Databases,Computation and Language