The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models

Karime Maamari,Fadhil Abubaker,Daniel Jaroslawicz,Amine Mhedhbi
2024-08-19
Abstract:Schema linking is a crucial step in Text-to-SQL pipelines. Its goal is to retrieve the relevant tables and columns of a target database for a user's query while disregarding irrelevant ones. However, imperfect schema linking can often exclude required columns needed for accurate query generation. In this work, we revisit schema linking when using the latest generation of large language models (LLMs). We find empirically that newer models are adept at utilizing relevant schema elements during generation even in the presence of large numbers of irrelevant ones. As such, our Text-to-SQL pipeline entirely forgoes schema linking in cases where the schema fits within the model's context window in order to minimize issues due to filtering required schema elements. Furthermore, instead of filtering contextual information, we highlight techniques such as augmentation, selection, and correction, and adopt them to improve the accuracy of our Text-to-SQL pipeline. Our approach ranks first on the BIRD benchmark achieving an accuracy of 71.83%.
Computation and Language
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to explore the necessity and effectiveness of schema linking in the Text - to - SQL conversion task in the era of modern large - language models (LLMs). Specifically, the paper mainly focuses on the following aspects: 1. **The role of schema linking**: - Schema linking is a crucial step in the Text - to - SQL pipeline. Its goal is to retrieve tables and columns relevant to the user query from the target database while ignoring irrelevant elements. - However, imperfect schema linking may exclude columns required for generating accurate SQL queries, thus leading to errors. 2. **The impact of LLMs on schema linking**: - The paper experimentally proves that as the reasoning ability of LLMs improves, these models can still correctly utilize relevant schema elements in the presence of a large number of irrelevant elements. - Therefore, when the schema can be fully placed in the model's context window, the schema linking step can be completely omitted to reduce problems caused by filtering required schema elements. 3. **Proposing alternatives**: - In order to improve the accuracy of the Text - to - SQL pipeline, the paper proposes using techniques such as augmentation, selection, and correction instead of relying on schema linking to filter context information. 4. **Experimental verification**: - The paper verifies the impact of different schema linking methods, augmentation techniques, selection techniques, and correction techniques on SQL generation accuracy through a series of experiments. - The experimental results show that for the latest LLMs, the benefits of schema linking are reduced and may even lead to performance degradation; while for earlier or smaller models, schema linking can still bring significant accuracy improvements. 5. **Conclusion**: - The paper concludes that in the most advanced LLMs, if the schema can be fully placed within the context window, schema linking is not necessary. But for earlier or smaller models, the accuracy gain of schema linking is still worth considering. - In actual data warehouse scenarios, when the entire schema exceeds the context window, the LLM's context window should be maximally utilized, and the most relevant columns should be selected to retain necessary schema elements. In summary, through experiments and analysis, this paper challenges the traditional view that schema linking is essential in the Text - to - SQL task and provides new ideas and technical directions for future research and applications.