TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Xianjie Wu,Jian Yang,Linzheng Chai,Ge Zhang,Jiaheng Liu,Xinrun Du,Di Liang,Daixin Shu,Xianfu Cheng,Tianzhen Sun,Guanglin Niu,Tongliang Li,Zhoujun Li

2024-08-17

Abstract:Recent advancements in Large Language Models (LLMs) have markedly enhanced the interpretation and processing of tabular data, introducing previously unimaginable capabilities. Despite these achievements, LLMs still encounter significant challenges when applied in industrial scenarios, particularly due to the increased complexity of reasoning required with real-world tabular data, underscoring a notable disparity between academic benchmarks and practical applications. To address this discrepancy, we conduct a detailed investigation into the application of tabular data in industrial scenarios and propose a comprehensive and complex benchmark TableBench, including 18 fields within four major categories of table question answering (TableQA) capabilities. Furthermore, we introduce TableLLM, trained on our meticulously constructed training set TableInstruct, achieving comparable performance with GPT-3.5. Massive experiments conducted on TableBench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT-4, achieves only a modest score compared to humans.

Computation and Language

What problem does this paper attempt to address?

The paper attempts to address the issue of existing large language models (LLMs) being insufficient in handling real-world tabular data, especially in complex reasoning tasks. Despite recent advancements significantly improving LLMs' understanding and processing capabilities of tabular data, they still face challenges in industrial application scenarios, mainly in the following aspects: 1. **Gap between academic benchmarks and real-world applications**: Existing academic benchmarks fail to fully reflect the complexity and diversity of industrial scenarios, leading to poor model performance in real environments. 2. **Complex reasoning requirements**: Tabular data in industrial scenarios often requires multi-step reasoning processes, and current LLMs have limited capabilities in this regard. 3. **Diverse task requirements**: The application of tabular data encompasses various aspects such as fact verification, numerical reasoning, data analysis, and visualization, with existing models performing unevenly across these tasks. To address these issues, the authors propose a comprehensive and complex benchmarking framework—TableBench, aimed at evaluating and enhancing LLMs' capabilities in handling real-world tabular data. TableBench includes 18 subcategories, covering four main areas of table question answering (TableQA) capabilities: fact verification, numerical reasoning, data analysis, and visualization. Through this benchmark, the authors hope to bridge the gap between academic research and industrial applications and promote further development of LLMs in practical scenarios.

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Benchmarking Table Comprehension In The Wild

TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with Scalable Context and Symbolic Extension

Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

Text2Analysis: A Benchmark of Table Question Answering with Advanced Data Analysis and Unclear Queries

On the Robustness of Language Models for Tabular Question Answering

TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains

Tree-of-Table: Unleashing the Power of LLMs for Enhanced Large-Scale Table Understanding

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios

Uncovering Limitations of Large Language Models in Information Seeking from Tables

CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models

Rethinking Tabular Data Understanding with Large Language Models

TableGPT2: A Large Multimodal Model with Tabular Data Integration

CABINET: Content Relevance based Noise Reduction for Table Question Answering

NLPBench: Evaluating Large Language Models on Solving NLP Problems

Bridging the Gap: Deciphering Tabular Data Using Large Language Model

CLR-Bench: Evaluating Large Language Models in College-level Reasoning