Abstract:This study presents a comparative analysis of the a complex SQL benchmark, TPC-DS, with two existing text-to-SQL benchmarks, BIRD and Spider. Our findings reveal that TPC-DS queries exhibit a significantly higher level of structural complexity compared to the other two benchmarks. This underscores the need for more intricate benchmarks to simulate realistic scenarios effectively. To facilitate this comparison, we devised several measures of structural complexity and applied them across all three benchmarks. The results of this study can guide future research in the development of more sophisticated text-to-SQL benchmarks. We utilized 11 distinct Language Models (LLMs) to generate SQL queries based on the query descriptions provided by the TPC-DS benchmark. The prompt engineering process incorporated both the query description as outlined in the TPC-DS specification and the database schema of TPC-DS. Our findings indicate that the current state-of-the-art generative AI models fall short in generating accurate decision-making queries. We conducted a comparison of the generated queries with the TPC-DS gold standard queries using a series of fuzzy structure matching techniques based on query features. The results demonstrated that the accuracy of the generated queries is insufficient for practical real-world application.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to evaluate and compare the performance of large - language models (LLMs) in the task of generating complex SQL queries, especially their ability when dealing with highly complex SQL queries. Specifically, the paper explores this problem in the following aspects: 1. **Benchmark Comparison**: - The paper introduces TPC - DS (Transaction Processing Performance Council Decision Support) as a new complex SQL query benchmark and compares it with the two existing main text - to - SQL generation benchmarks, BIRD and Spider. - Through comparison, it is found that the structural complexity of TPC - DS queries is significantly higher than that of BIRD and Spider. This indicates that the existing benchmarks may not fully simulate complex scenarios in the real world. 2. **Complexity Measurement**: - In order to quantify the complexity differences between different benchmarks, the paper defines a series of structural complexity indicators, including the number of sub - queries, JOIN operations, aggregate functions, common table expressions (CTE), the number of predicates in the WHERE clause, etc. - The results show that TPC - DS queries exhibit higher complexity in all these indicators, further proving its importance as a complex SQL generation benchmark. 3. **LLM Performance Evaluation**: - The paper uses 11 different LLMs (such as gpt - 4, gemini - 1.5, etc.) to generate SQL queries based on the TPC - DS workload and evaluates the performance of these models. - The results show that although some large LLMs can generate queries with a certain degree of complexity, they are still insufficient in terms of accuracy and practicality, especially when dealing with complex decision - making queries. 4. **Future Work Directions**: - The paper proposes several improvement directions, including incremental SQL generation, fine - tuning small models, and introducing a human - machine collaborative workflow, in order to improve the accuracy of complex SQL generation tasks. In summary, this paper aims to reveal the limitations of existing LLMs in handling complex SQL query tasks and emphasizes the need for more complex benchmarks and improved methods to enhance the performance of models.

Evaluating LLMs for Text-to-SQL Generation With Complex SQL Workload

Analyzing the Effectiveness of Large Language Models on Text-to-SQL Synthesis

Enhancing LLM Fine-tuning for Text-to-SQLs by SQL Quality Measurement

Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL

Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs

Interactive-T2S: Multi-Turn Interactions for Text-to-SQL with Large Language Models

Using LLM to select the right SQL Query from candidates

A Survey of Large Language Model-Based Generative AI for Text-to-SQL: Benchmarks, Applications, Use Cases, and Challenges

SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL

From Natural Language to SQL: Review of LLM-based Text-to-SQL Systems

SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended)

BEAVER: An Enterprise Benchmark for Text-to-SQL

Enhancing Text-to-SQL Translation for Financial System Design

MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation

PTD-SQL: Partitioning and Targeted Drilling with LLMs in Text-to-SQL

SEA-SQL: Semantic-Enhanced Text-to-SQL with Adaptive Refinement

Exploring the Use of LLMs for SQL Equivalence Checking

Large Language Model Enhanced Text-to-SQL Generation: A Survey

SQLfuse: Enhancing Text-to-SQL Performance through Comprehensive LLM Synergy