Abstract:The development of Large Language Models (LLMs) has revolutionized QA across various industries, including the database domain. However, there is still a lack of a comprehensive benchmark to evaluate the capabilities of different LLMs and their modular components in database QA. To this end, we introduce DQABench, the first comprehensive database QA benchmark for LLMs. DQABench features an innovative LLM-based method to automate the generation, cleaning, and rewriting of evaluation dataset, resulting in over 200,000 QA pairs in English and Chinese, separately. These QA pairs cover a wide range of database-related knowledge extracted from manuals, online communities, and database instances. This inclusion allows for an additional assessment of LLMs' Retrieval-Augmented Generation (RAG) and Tool Invocation Generation (TIG) capabilities in the database QA task. Furthermore, we propose a comprehensive LLM-based database QA testbed DQATestbed. This testbed is highly modular and scalable, with basic and advanced components such as Question Classification Routing (QCR), RAG, TIG, and Prompt Template Engineering (PTE). Moreover, DQABench provides a comprehensive evaluation pipeline that computes various metrics throughout a standardized evaluation process to ensure the accuracy and fairness of the evaluation. We use DQABench to evaluate the database QA capabilities under the proposed testbed comprehensively. The evaluation reveals findings like (i) the strengths and limitations of nine LLM-based QA bots and (ii) the performance impact and potential improvements of various service components (e.g., QCR, RAG, TIG). Our benchmark and findings will guide the future development of LLM-based database QA research.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: currently, there is a lack of comprehensive evaluation of the capabilities of large language models (LLMs) in database question - answering (DBQA) tasks. Specifically, the paper points out the following challenges: 1. **DBQA Dataset Issues (C1)**: - **Low Question Quality**: Online questions are too short and lack necessary background information. - **Low Answer Quality**: Many online answers have factual errors, are too brief, or are overly subjective. - **Limited Diversity**: Due to convergence bias and other factors, questions in online communities are concentrated on a few popular topics and database management system (DBMS) products. 2. **DBQA Testing Platform Issues (C2)**: - Previous evaluations have mainly focused on independent LLMs, ignoring indispensable components in DBQA, such as pre - training, fine - tuning, question classification routing (QCR), retrieval - augmented generation (RAG), and tool - invocation generation (TIG). Therefore, a testing platform that can integrate these components is needed to evaluate the functions of different LLMs. 3. **DBQA Evaluation Issues (C3)**: - Existing benchmarks have not fully compared the capabilities of LLMs in key aspects of the database field, such as the factual accuracy and technical depth of answers, but have focused more on the fluency of explanations. - There is a lack of reasonable metrics to measure the fine - grained performance and end - to - end performance of intermediate components (such as QCR, RAG, TIG). To solve these problems, the authors propose DQABench (Database Question - Answer benchmark), which is a comprehensive database question - answering benchmark designed to evaluate the performance of LLMs in database question - answering tasks. By constructing a high - quality dataset, designing a modular testing platform, and implementing a standardized evaluation process, DQABench can better understand the advantages and limitations of LLMs in DBQA and guide the future development of LLM - based database applications. ### Specific Contributions 1. **Constructed the First Database Question - Answer Benchmark Dataset DQABench**: - It covers a wide range of database - related knowledge and complex questions, including general database questions, product - specific questions, and instance - specific questions. - The dataset contains more than 200,000 Chinese - English question - answer pairs, which is larger than the existing instruction datasets in the IT field. 2. **Proposed a Plug - and - Play Testing Platform**: - It integrates all possible DBQA - involved components, such as QCR, PTE, RAG, and TIG, and supports different LLM application strategies. 3. **Conducted In - Depth Evaluations**: - Evaluated the end - to - end performance of nine open - source and commercial LLMs and analyzed the impact of different components (such as different RAG solutions and question - type classifiers) on performance. - Discovered some important insights, such as LLM performance differences, the importance of pre - training and fine - tuning, the necessity of question routing, the impact of knowledge retrieval, and the insufficiency of tool selection and invocation capabilities. Through these contributions, the paper provides important references and guidance for future research and development of LLMs in the database question - answering field.

Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation

TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with Scalable Context and Symbolic Extension

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

Let LLMs Take on the Latest Challenges! A Chinese Dynamic Question Answering Benchmark

Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation

On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering

Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models

LogEval: A Comprehensive Benchmark Suite for Large Language Models In Log Analysis

InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models

OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models

LLM As DBA

DebateQA: Evaluating Question Answering on Debatable Knowledge

AgentBench: Evaluating LLMs as Agents

DHP Benchmark: Are LLMs Good NLG Evaluators?

Software Testing with Large Language Models: Survey, Landscape, and Vision

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

DB-GPT: Large Language Model Meets Database

NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens