Abstract:NL2SQL (Natural Language to Structured Query Language) transformation has seen wide adoption in Business Intelligence (BI) applications in recent years. However, existing NL2SQL benchmarks are not suitable for production BI scenarios, as they are not designed for common business intelligence questions. To address this gap, we have developed a new benchmark focused on typical NL questions in industrial BI scenarios. We discuss the challenges of constructing a BI-focused benchmark and the shortcomings of existing benchmarks. Additionally, we introduce question categories in our benchmark that reflect common BI inquiries. Lastly, we propose two novel semantic similarity evaluation metrics for assessing NL2SQL capabilities in BI applications and services.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to address the deficiencies of existing natural - language - to - structured query language (NL2SQL) benchmark tests in business intelligence (BI) scenarios. Specifically, the paper points out the following: 1. **Existing NL2SQL benchmark tests are not suitable for production BI scenarios**: - Most existing benchmark tests are designed for general natural - language queries rather than for common business intelligence problems. For example, the WikiSql benchmark test mainly contains factual questions about Wikipedia, while in BI scenarios there are more time - series data and complex multi - table queries. - Existing performance evaluation metrics (such as the exact match rate) are too strict and cannot reflect the useful information of partial matches. 2. **Challenges of database schemas and contents**: - BI databases may contain irregularities in schema definitions, such as the same data having different column names in different tables, or the same column name representing different data in different tables. - Existing benchmark tests do not consider the impact of these irregularities on NL2SQL performance. 3. **Time - related challenges**: - Most natural - language questions in BI queries contain time constraints, such as "last Friday", "the past 5 days", etc., but existing benchmark tests do not cover these time ranges. - The inference of time constraints is very challenging for NL2SQL models. 4. **Challenges of question contexts**: - User questions in existing benchmark tests do not cover complex time - series selections, nor are they classified from the perspective of business analysis. - Questions in BI scenarios widely use technical terms and require domain knowledge to understand user intentions. 5. **Challenges of question languages**: - Different languages may be used in combination in questions, such as English abbreviations combined with Chinese or other non - Latin - based languages. - Queries by non - English users may use both English and their native languages simultaneously, which makes direct keyword matching difficult. 6. **Challenges of evaluation metrics**: - Performance metrics in existing benchmark tests usually adopt an exact - match strategy, which may overlook partially or semantically identical prediction results. - The complexity in comparing two SQL statements lies in that SQL statements that are almost completely different in syntax may have the same or similar meanings, and different queries may produce the same or similar results. ### Solutions To solve the above problems, the paper proposes a new benchmark test - BIS (Business Intelligence Scenario), and its main contributions are as follows: 1. **Describe the deficiencies of existing benchmark tests and evaluation metrics in BI scenarios**. 2. **Propose a new benchmark test that includes two new evaluation metrics (semantic query similarity and result partial similarity)** to evaluate model performance more realistically. Through these improvements, the BIS benchmark test can better support and evaluate the application of NL2SQL models in business intelligence scenarios.

BIS: NL2SQL Service Evaluation Benchmark for Business Intelligence Scenarios

ChatBI: Towards Natural Language to Complex Business Intelligence SQL

ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems

AIBench: An Industry Standard AI Benchmark Suite from Internet Services

AIBench Scenario: Scenario-Distilling AI Benchmarking.

Synthesizing Natural Language to Visualization (NL2VIS) Benchmarks from NL2SQL Benchmarks

The Dawn of Natural Language to SQL: Are We Fully Ready?

BEAVER: An Enterprise Benchmark for Text-to-SQL

NL2KQL: From Natural Language to Kusto Query

CatSQL: Towards Real World Natural Language to SQL Applications.

Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs

DataLab: A Unified Platform for LLM-Powered Business Intelligence

An Integrated Interactive Framework for Natural Language to SQL Translation.

Nvbench: A Large-Scale Synthesized Dataset for Cross-Domain Natural Language to Visualization Task

NLPre: a revised approach towards language-centric benchmarking of Natural Language Preprocessing systems

Evaluating NoSQL Databases for OLAP Workloads: A Benchmarking Study of MongoDB, Redis, Kudu and ArangoDB

BizBench: A Quantitative Reasoning Benchmark for Business and Finance

Scalability and Performance Evaluation of NewSQL and Relational Databases: A Comparative Benchmark Study

E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL

A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?