BIS: NL2SQL Service Evaluation Benchmark for Business Intelligence Scenarios

Bora Caglayan,Mingxue Wang,John D. Kelleher,Shen Fei,Gui Tong,Jiandong Ding,Puchao Zhang
2024-10-30
Abstract:NL2SQL (Natural Language to Structured Query Language) transformation has seen wide adoption in Business Intelligence (BI) applications in recent years. However, existing NL2SQL benchmarks are not suitable for production BI scenarios, as they are not designed for common business intelligence questions. To address this gap, we have developed a new benchmark focused on typical NL questions in industrial BI scenarios. We discuss the challenges of constructing a BI-focused benchmark and the shortcomings of existing benchmarks. Additionally, we introduce question categories in our benchmark that reflect common BI inquiries. Lastly, we propose two novel semantic similarity evaluation metrics for assessing NL2SQL capabilities in BI applications and services.
Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to address the deficiencies of existing natural - language - to - structured query language (NL2SQL) benchmark tests in business intelligence (BI) scenarios. Specifically, the paper points out the following: 1. **Existing NL2SQL benchmark tests are not suitable for production BI scenarios**: - Most existing benchmark tests are designed for general natural - language queries rather than for common business intelligence problems. For example, the WikiSql benchmark test mainly contains factual questions about Wikipedia, while in BI scenarios there are more time - series data and complex multi - table queries. - Existing performance evaluation metrics (such as the exact match rate) are too strict and cannot reflect the useful information of partial matches. 2. **Challenges of database schemas and contents**: - BI databases may contain irregularities in schema definitions, such as the same data having different column names in different tables, or the same column name representing different data in different tables. - Existing benchmark tests do not consider the impact of these irregularities on NL2SQL performance. 3. **Time - related challenges**: - Most natural - language questions in BI queries contain time constraints, such as "last Friday", "the past 5 days", etc., but existing benchmark tests do not cover these time ranges. - The inference of time constraints is very challenging for NL2SQL models. 4. **Challenges of question contexts**: - User questions in existing benchmark tests do not cover complex time - series selections, nor are they classified from the perspective of business analysis. - Questions in BI scenarios widely use technical terms and require domain knowledge to understand user intentions. 5. **Challenges of question languages**: - Different languages may be used in combination in questions, such as English abbreviations combined with Chinese or other non - Latin - based languages. - Queries by non - English users may use both English and their native languages simultaneously, which makes direct keyword matching difficult. 6. **Challenges of evaluation metrics**: - Performance metrics in existing benchmark tests usually adopt an exact - match strategy, which may overlook partially or semantically identical prediction results. - The complexity in comparing two SQL statements lies in that SQL statements that are almost completely different in syntax may have the same or similar meanings, and different queries may produce the same or similar results. ### Solutions To solve the above problems, the paper proposes a new benchmark test - BIS (Business Intelligence Scenario), and its main contributions are as follows: 1. **Describe the deficiencies of existing benchmark tests and evaluation metrics in BI scenarios**. 2. **Propose a new benchmark test that includes two new evaluation metrics (semantic query similarity and result partial similarity)** to evaluate model performance more realistically. Through these improvements, the BIS benchmark test can better support and evaluate the application of NL2SQL models in business intelligence scenarios.