Abstract:The advent of large language models (LLMs) has unlocked great opportunities in complex data management tasks, particularly in question answering (QA) over complicated multi-table relational data. Despite significant progress, systematically evaluating LLMs on multi-table QA remains a critical challenge due to the inherent complexity of analyzing heterogeneous table structures and potential large scale of serialized relational data. Existing benchmarks primarily focus on single-table QA, failing to capture the intricacies of reasoning across multiple relational tables, as required in real-world domains such as finance, healthcare, and e-commerce. To address this gap, we present TQA-Bench, a new multi-table QA benchmark designed to evaluate the capabilities of LLMs in tackling complex QA tasks over relational data. Our benchmark incorporates diverse relational database instances sourced from real-world public datasets and introduces a flexible sampling mechanism to create tasks with varying multi-table context lengths, ranging from 8K to 64K tokens. To ensure robustness and reliability, we integrate symbolic extensions into the evaluation framework, enabling the assessment of LLM reasoning capabilities beyond simple data retrieval or probabilistic pattern matching. We systematically evaluate a range of LLMs, both open-source and closed-source, spanning model scales from 7 billion to 70 billion parameters. Our extensive experiments reveal critical insights into the performance of LLMs in multi-table QA, highlighting both challenges and opportunities for advancing their application in complex, data-driven environments. Our benchmark implementation and results are available at <a class="link-external link-https" href="https://github.com/Relaxed-System-Lab/TQA-Bench" rel="external noopener nofollow">this https URL</a>.

Benchmarking Table Comprehension In The Wild

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with Scalable Context and Symbolic Extension

Text2Analysis: A Benchmark of Table Question Answering with Advanced Data Analysis and Unclear Queries

NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

Uncovering Limitations of Large Language Models in Information Seeking from Tables

Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

On the Robustness of Language Models for Tabular Question Answering

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

FinanceBench: A New Benchmark for Financial Question Answering

SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation

Benchmarking Foundation Models with Language-Model-as-an-Examiner

EconLogicQA: A Question-Answering Benchmark for Evaluating Large Language Models in Economic Sequential Reasoning

Towards Benchmarking Situational Awareness of Large Language Models:Comprehensive Benchmark, Evaluation and Analysis

ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models