Abstract:Tables contain factual and quantitative data accompanied by various structures and contents that pose challenges for machine comprehension. Previous methods generally design task-specific architectures and objectives for individual tasks, resulting in modal isolation and intricate workflows. In this paper, we present a novel large vision-language model, TabPedia, equipped with a concept synergy mechanism. In this mechanism, all the involved diverse visual table understanding (VTU) tasks and multi-source visual embeddings are abstracted as concepts. This unified framework allows TabPedia to seamlessly integrate VTU tasks, such as table detection, table structure recognition, table querying, and table question answering, by leveraging the capabilities of large language models (LLMs). Moreover, the concept synergy mechanism enables table perception-related and comprehension-related tasks to work in harmony, as they can effectively leverage the needed clues from the corresponding source perception embeddings. Furthermore, to better evaluate the VTU task in real-world scenarios, we establish a new and comprehensive table VQA benchmark, ComTQA, featuring approximately 9,000 QA pairs. Extensive quantitative and qualitative experiments on both table perception and comprehension tasks, conducted across various public benchmarks, validate the effectiveness of our TabPedia. The superior performance further confirms the feasibility of using LLMs for understanding visual tables when all concepts work in synergy. The benchmark ComTQA has been open-sourced at <a class="link-external link-https" href="https://huggingface.co/datasets/ByteDance/ComTQA" rel="external noopener nofollow">this https URL</a>. The source code and model also have been released athttps://github.com/zhaowc-ustc/TabPedia.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of existing methods in handling the Visual Table Understanding (VTU) tasks. Specifically: 1. **Limitations of task - specific architectures**: Previous methods usually design specific architectures and objectives for each VTU subtask, which leads to pattern isolation and complex processes. For example, tasks such as table detection, table structure recognition, table query, and table question - answering are independent of each other and lack a unified framework. 2. **Challenges in multi - modal information integration**: VTU tasks need to handle visual - semantic representations at different granularities and levels, which poses a challenge to the model's multi - modal information integration ability. Although traditional LVLMs have made significant progress in visual understanding, they still face difficulties in two - dimensional table parsing and understanding. 3. **Insufficient evaluation in real - world scenarios**: Existing VTU benchmark datasets cannot fully reflect the complexity and diversity in the real world, especially in tasks involving table content understanding and reasoning. To address these problems, the author proposes a new large - scale visual - language model TabPedia, whose main contributions include: - **Unified framework**: By introducing the concept - synergy mechanism, all VTU tasks and multi - source visual embeddings are abstracted into concepts, achieving seamless integration of VTU tasks. - **Concept - synergy mechanism**: By introducing meditative tokens, the perception and understanding tasks can work in synergy, thus making more effective use of useful information in multi - source visual embeddings and task instructions. - **New benchmark dataset**: A new comprehensive table VQA benchmark dataset ComTQA, containing approximately 9,000 question - answer pairs, is constructed to more comprehensively evaluate the performance of VTU tasks in real - world scenarios. Through these innovations, TabPedia not only verifies its effectiveness on multiple public benchmark datasets but also shows its potential in more complex and realistic tasks. ### Formula Examples To ensure the correctness and readability of formulas, the following are some formula examples that may be involved in the paper (assuming relevant formulas exist): - **Loss function**: \[ L = -\sum_{i = 1}^{N}\log P(y_i|x_i) \] where \(N\) is the number of samples, and \(P(y_i|x_i)\) is the predicted probability given the input \(x_i\). - **Tree Edit Distance Similarity (TEDS)**: \[ \text{TEDS}(T_1, T_2)=1-\frac{\text{TED}(T_1, T_2)}{\max(\text{size}(T_1),\text{size}(T_2))} \] where \(T_1\) and \(T_2\) are two HTML table trees, and \(\text{TED}(T_1, T_2)\) is the tree edit distance between them. - **GriTS Metric**: \[ \text{GriTS}=\alpha\cdot\text{Top}+\beta\cdot\text{Cont}+\gamma\cdot\text{Loc} \] where \(\alpha, \beta, \gamma\) are weighting parameters that measure the recognition accuracy of cell topology, content, and location respectively. These formulas help explain the training and evaluation processes of the model, ensuring the accurate conveyance of technical details.

TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy

Multimodal Table Understanding

End-to-End Compound Table Understanding with Multi-Modal Modeling

UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition

TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT

Tree-of-Table: Unleashing the Power of LLMs for Enhanced Large-Scale Table Understanding

Bridging the Gap: Deciphering Tabular Data Using Large Language Model

Bridge the Gap between Language models and Tabular Understanding

Beyond Embeddings: The Promise of Visual Table in Visual Reasoning

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Image-based table recognition: data, model, and evaluation

Towards Visual Taxonomy Expansion

HeGTa: Leveraging Heterogeneous Graph-enhanced Large Language Models for Few-shot Complex Table Understanding

DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding

TableGPT2: A Large Multimodal Model with Tabular Data Integration

ACCIO: Table Understanding Enhanced via Contrastive Learning with Aggregations

TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains

DocTabQA: Answering Questions from Long Documents Using Tables

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

Benchmarking Table Comprehension In The Wild