Multimodal Table Understanding

Mingyu Zheng,Xinwei Feng,Qingyi Si,Qiaoqiao She,Zheng Lin,Wenbin Jiang,Weiping Wang

2024-06-12

Abstract:Although great progress has been made by previous table understanding methods including recent approaches based on large language models (LLMs), they rely heavily on the premise that given tables must be converted into a certain text sequence (such as Markdown or HTML) to serve as model input. However, it is difficult to access such high-quality textual table representations in some real-world scenarios, and table images are much more accessible. Therefore, how to directly understand tables using intuitive visual information is a crucial and urgent challenge for developing more practical applications. In this paper, we propose a new problem, multimodal table understanding, where the model needs to generate correct responses to various table-related requests based on the given table image. To facilitate both the model training and evaluation, we construct a large-scale dataset named MMTab, which covers a wide spectrum of table images, instructions and tasks. On this basis, we develop Table-LLaVA, a generalist tabular multimodal large language model (MLLM), which significantly outperforms recent open-source MLLM baselines on 23 benchmarks under held-in and held-out settings. The code and data is available at this <a class="link-external link-https" href="https://github.com/SpursGoZmy/Table-LLaVA" rel="external noopener nofollow">this https URL</a>

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper mainly addresses the following issues: 1. **Direct Understanding of Table Images**: Existing table understanding methods mostly rely on converting tables into specific text sequences (such as Markdown or HTML) as input, which can be challenging in practical scenarios, especially when dealing with scanned documents or web screenshots. Therefore, the paper proposes a new problem setting—Multimodal Table Understanding, aiming to enable models to generate correct responses directly based on table images. 2. **Constructing a Comprehensive Dataset**: To promote the development of multimodal table understanding, the authors have constructed a large-scale dataset called MMTab. This dataset covers a diverse range of table images, instructions, and tasks, supporting model training and evaluation. MMTab not only includes common table understanding tasks but also introduces some novel table structure understanding tasks. 3. **Developing a General Multimodal Large Language Model**: Based on the MMTab dataset, the authors have developed a general table multimodal large language model named Table-LLaV A. This model, through a two-stage training paradigm, significantly outperforms existing open-source multimodal large language models on a series of benchmark tests and approaches the performance of powerful closed-source models like GPT-4V on certain tasks. In summary, the main contributions of this paper are the proposal of a new problem—multimodal table understanding, the construction of a comprehensive dataset for this purpose, and the development of a high-performance multimodal model to advance research in this field.

Multimodal Table Understanding

TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy

Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

TableVLM: Multi-modal Pre-training for Table Structure Recognition

Large Language Model for Table Processing: A Survey

TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

MULTI: Multimodal Understanding Leaderboard with Text and Images

TableLlama: Towards Open Large Generalist Models for Tables

TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

TableRAG: Million-Token Table Understanding with Language Models

Rethinking Tabular Data Understanding with Large Language Models

Bridging the Gap: Deciphering Tabular Data Using Large Language Model

UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

End-to-End Compound Table Understanding with Multi-Modal Modeling

MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning

Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science

Tree-of-Table: Unleashing the Power of LLMs for Enhanced Large-Scale Table Understanding

MIBench: Evaluating Multimodal Large Language Models over Multiple Images