TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with Scalable Context and Symbolic Extension

Zipeng Qiu,You Peng,Guangxin He,Binhang Yuan,Chen Wang
2024-11-29
Abstract:The advent of large language models (LLMs) has unlocked great opportunities in complex data management tasks, particularly in question answering (QA) over complicated multi-table relational data. Despite significant progress, systematically evaluating LLMs on multi-table QA remains a critical challenge due to the inherent complexity of analyzing heterogeneous table structures and potential large scale of serialized relational data. Existing benchmarks primarily focus on single-table QA, failing to capture the intricacies of reasoning across multiple relational tables, as required in real-world domains such as finance, healthcare, and e-commerce. To address this gap, we present TQA-Bench, a new multi-table QA benchmark designed to evaluate the capabilities of LLMs in tackling complex QA tasks over relational data. Our benchmark incorporates diverse relational database instances sourced from real-world public datasets and introduces a flexible sampling mechanism to create tasks with varying multi-table context lengths, ranging from 8K to 64K tokens. To ensure robustness and reliability, we integrate symbolic extensions into the evaluation framework, enabling the assessment of LLM reasoning capabilities beyond simple data retrieval or probabilistic pattern matching. We systematically evaluate a range of LLMs, both open-source and closed-source, spanning model scales from 7 billion to 70 billion parameters. Our extensive experiments reveal critical insights into the performance of LLMs in multi-table QA, highlighting both challenges and opportunities for advancing their application in complex, data-driven environments. Our benchmark implementation and results are available at <a class="link-external link-https" href="https://github.com/Relaxed-System-Lab/TQA-Bench" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence,Computation and Language,Information Retrieval
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to address the challenges of systematic evaluation of large - language models (LLMs) in multi - table question - answering (Multi - Table QA) tasks. Specifically, existing benchmarks mainly focus on single - table question - answering and fail to capture the complexity of reasoning across multiple relational tables, which is a common requirement in practical application areas such as finance, healthcare, and e - commerce. #### Main problems include: 1. **Limitations of single - table benchmarks**: Most of the existing table - question - answering benchmarks are designed based on a single - table context and cannot reflect the complexity of multi - table structures. 2. **Insufficient data scale and heterogeneity**: The tables in existing benchmarks usually have a small amount of data and lack the heterogeneity and diversity of real - world data. 3. **Evaluation reliability issues**: Evaluation with a fixed set of questions may lead models to learn specific probability pattern - matching rather than demonstrating strong generalization abilities. To solve these problems, the authors propose a new multi - table question - answering benchmark - **TQA - Bench**. This benchmark improves existing methods in the following ways: - **Diverse multi - table data sources**: Collect multi - table relational database instances from real - world public datasets to ensure the authenticity and complexity of the data. - **Flexible sampling mechanism**: Create tasks with different multi - table context lengths, ranging from 8K to 64K tokens, to evaluate the LLMs' ability to handle data of different scales. - **Symbolic extension**: Introduce symbolic extension to evaluate the LLMs' reasoning abilities, rather than just simple data retrieval or probability pattern - matching. Through these improvements, TQA - Bench can more comprehensively evaluate the performance of LLMs in complex multi - table question - answering tasks and provide valuable guidance for future research and applications. ### Summary This paper attempts to solve the limitations in existing multi - table question - answering benchmarks and proposes a new benchmark framework, TQA - Bench, to more accurately evaluate the performance of large - language models when dealing with complex multi - table data.