Abstract:The advent of large language models (LLMs) has unlocked great opportunities in complex data management tasks, particularly in question answering (QA) over complicated multi-table relational data. Despite significant progress, systematically evaluating LLMs on multi-table QA remains a critical challenge due to the inherent complexity of analyzing heterogeneous table structures and potential large scale of serialized relational data. Existing benchmarks primarily focus on single-table QA, failing to capture the intricacies of reasoning across multiple relational tables, as required in real-world domains such as finance, healthcare, and e-commerce. To address this gap, we present TQA-Bench, a new multi-table QA benchmark designed to evaluate the capabilities of LLMs in tackling complex QA tasks over relational data. Our benchmark incorporates diverse relational database instances sourced from real-world public datasets and introduces a flexible sampling mechanism to create tasks with varying multi-table context lengths, ranging from 8K to 64K tokens. To ensure robustness and reliability, we integrate symbolic extensions into the evaluation framework, enabling the assessment of LLM reasoning capabilities beyond simple data retrieval or probabilistic pattern matching. We systematically evaluate a range of LLMs, both open-source and closed-source, spanning model scales from 7 billion to 70 billion parameters. Our extensive experiments reveal critical insights into the performance of LLMs in multi-table QA, highlighting both challenges and opportunities for advancing their application in complex, data-driven environments. Our benchmark implementation and results are available at <a class="link-external link-https" href="https://github.com/Relaxed-System-Lab/TQA-Bench" rel="external noopener nofollow">this https URL</a>.

Enhancing Temporal Understanding in LLMs for Semi-structured Tables

Rethinking Tabular Data Understanding with Large Language Models

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Temporal Knowledge Question Answering via Abstract Reasoning Induction

On the Robustness of Language Models for Tabular Question Answering

Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science

MenatQA: A New Dataset for Testing the Temporal Comprehension and Reasoning Abilities of Large Language Models

Remember This Event That Year? Assessing Temporal Information and Reasoning in Large Language Models

Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning?

TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios

Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs

Large Language Models are few(1)-shot Table Reasoners

Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding

TabSQLify: Enhancing Reasoning Capabilities of LLMs Through Table Decomposition

A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting

Large Language Models Can Learn Temporal Reasoning

Tree-of-Table: Unleashing the Power of LLMs for Enhanced Large-Scale Table Understanding

TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with Scalable Context and Symbolic Extension

LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models

A Survey of Table Reasoning with Large Language Models