Abstract:The advent of large language models (LLMs) has unlocked great opportunities in complex data management tasks, particularly in question answering (QA) over complicated multi-table relational data. Despite significant progress, systematically evaluating LLMs on multi-table QA remains a critical challenge due to the inherent complexity of analyzing heterogeneous table structures and potential large scale of serialized relational data. Existing benchmarks primarily focus on single-table QA, failing to capture the intricacies of reasoning across multiple relational tables, as required in real-world domains such as finance, healthcare, and e-commerce. To address this gap, we present TQA-Bench, a new multi-table QA benchmark designed to evaluate the capabilities of LLMs in tackling complex QA tasks over relational data. Our benchmark incorporates diverse relational database instances sourced from real-world public datasets and introduces a flexible sampling mechanism to create tasks with varying multi-table context lengths, ranging from 8K to 64K tokens. To ensure robustness and reliability, we integrate symbolic extensions into the evaluation framework, enabling the assessment of LLM reasoning capabilities beyond simple data retrieval or probabilistic pattern matching. We systematically evaluate a range of LLMs, both open-source and closed-source, spanning model scales from 7 billion to 70 billion parameters. Our extensive experiments reveal critical insights into the performance of LLMs in multi-table QA, highlighting both challenges and opportunities for advancing their application in complex, data-driven environments. Our benchmark implementation and results are available at <a class="link-external link-https" href="https://github.com/Relaxed-System-Lab/TQA-Bench" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the challenges of systematic evaluation of large - language models (LLMs) in multi - table question - answering (Multi - Table QA) tasks. Specifically, existing benchmarks mainly focus on single - table question - answering and fail to capture the complexity of reasoning across multiple relational tables, which is a common requirement in practical application areas such as finance, healthcare, and e - commerce. #### Main problems include: 1. **Limitations of single - table benchmarks**: Most of the existing table - question - answering benchmarks are designed based on a single - table context and cannot reflect the complexity of multi - table structures. 2. **Insufficient data scale and heterogeneity**: The tables in existing benchmarks usually have a small amount of data and lack the heterogeneity and diversity of real - world data. 3. **Evaluation reliability issues**: Evaluation with a fixed set of questions may lead models to learn specific probability pattern - matching rather than demonstrating strong generalization abilities. To solve these problems, the authors propose a new multi - table question - answering benchmark - **TQA - Bench**. This benchmark improves existing methods in the following ways: - **Diverse multi - table data sources**: Collect multi - table relational database instances from real - world public datasets to ensure the authenticity and complexity of the data. - **Flexible sampling mechanism**: Create tasks with different multi - table context lengths, ranging from 8K to 64K tokens, to evaluate the LLMs' ability to handle data of different scales. - **Symbolic extension**: Introduce symbolic extension to evaluate the LLMs' reasoning abilities, rather than just simple data retrieval or probability pattern - matching. Through these improvements, TQA - Bench can more comprehensively evaluate the performance of LLMs in complex multi - table question - answering tasks and provide valuable guidance for future research and applications. ### Summary This paper attempts to solve the limitations in existing multi - table question - answering benchmarks and proposes a new benchmark framework, TQA - Bench, to more accurately evaluate the performance of large - language models when dealing with complex multi - table data.

TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with Scalable Context and Symbolic Extension

Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

Leveraging Large Language Models for Multiple Choice Question Answering

On the Robustness of Language Models for Tabular Question Answering

NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction using Cogtale dataset

Let LLMs Take on the Latest Challenges! A Chinese Dynamic Question Answering Benchmark

CRT-QA: A Dataset of Complex Reasoning Question Answering over Tabular Data

TrustUQA: A Trustful Framework for Unified Structured Data Question Answering

Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering

CABINET: Content Relevance based Noise Reduction for Table Question Answering

Interactive-KBQA: Multi-Turn Interactions for Knowledge Base Question Answering with Large Language Models

Text2Analysis: A Benchmark of Table Question Answering with Advanced Data Analysis and Unclear Queries

Exploring the Impact of Table-to-Text Methods on Augmenting LLM-based Question Answering with Domain Hybrid Data

Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning

MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering

DEXTER: A Benchmark for open-domain Complex Question Answering using LLMs

Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions