Evaluating LLMs for Text-to-SQL Generation With Complex SQL Workload

Limin Ma,Ken Pu,Ying Zhu
2024-07-28
Abstract:This study presents a comparative analysis of the a complex SQL benchmark, TPC-DS, with two existing text-to-SQL benchmarks, BIRD and Spider. Our findings reveal that TPC-DS queries exhibit a significantly higher level of structural complexity compared to the other two benchmarks. This underscores the need for more intricate benchmarks to simulate realistic scenarios effectively. To facilitate this comparison, we devised several measures of structural complexity and applied them across all three benchmarks. The results of this study can guide future research in the development of more sophisticated text-to-SQL benchmarks. We utilized 11 distinct Language Models (LLMs) to generate SQL queries based on the query descriptions provided by the TPC-DS benchmark. The prompt engineering process incorporated both the query description as outlined in the TPC-DS specification and the database schema of TPC-DS. Our findings indicate that the current state-of-the-art generative AI models fall short in generating accurate decision-making queries. We conducted a comparison of the generated queries with the TPC-DS gold standard queries using a series of fuzzy structure matching techniques based on query features. The results demonstrated that the accuracy of the generated queries is insufficient for practical real-world application.
Databases,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate and compare the performance of large - language models (LLMs) in the task of generating complex SQL queries, especially their ability when dealing with highly complex SQL queries. Specifically, the paper explores this problem in the following aspects: 1. **Benchmark Comparison**: - The paper introduces TPC - DS (Transaction Processing Performance Council Decision Support) as a new complex SQL query benchmark and compares it with the two existing main text - to - SQL generation benchmarks, BIRD and Spider. - Through comparison, it is found that the structural complexity of TPC - DS queries is significantly higher than that of BIRD and Spider. This indicates that the existing benchmarks may not fully simulate complex scenarios in the real world. 2. **Complexity Measurement**: - In order to quantify the complexity differences between different benchmarks, the paper defines a series of structural complexity indicators, including the number of sub - queries, JOIN operations, aggregate functions, common table expressions (CTE), the number of predicates in the WHERE clause, etc. - The results show that TPC - DS queries exhibit higher complexity in all these indicators, further proving its importance as a complex SQL generation benchmark. 3. **LLM Performance Evaluation**: - The paper uses 11 different LLMs (such as gpt - 4, gemini - 1.5, etc.) to generate SQL queries based on the TPC - DS workload and evaluates the performance of these models. - The results show that although some large LLMs can generate queries with a certain degree of complexity, they are still insufficient in terms of accuracy and practicality, especially when dealing with complex decision - making queries. 4. **Future Work Directions**: - The paper proposes several improvement directions, including incremental SQL generation, fine - tuning small models, and introducing a human - machine collaborative workflow, in order to improve the accuracy of complex SQL generation tasks. In summary, this paper aims to reveal the limitations of existing LLMs in handling complex SQL query tasks and emphasizes the need for more complex benchmarks and improved methods to enhance the performance of models.