TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy

Weichao Zhao,Hao Feng,Qi Liu,Jingqun Tang,Shu Wei,Binghong Wu,Lei Liao,Yongjie Ye,Hao Liu,Wengang Zhou,Houqiang Li,Can Huang
2024-10-11
Abstract:Tables contain factual and quantitative data accompanied by various structures and contents that pose challenges for machine comprehension. Previous methods generally design task-specific architectures and objectives for individual tasks, resulting in modal isolation and intricate workflows. In this paper, we present a novel large vision-language model, TabPedia, equipped with a concept synergy mechanism. In this mechanism, all the involved diverse visual table understanding (VTU) tasks and multi-source visual embeddings are abstracted as concepts. This unified framework allows TabPedia to seamlessly integrate VTU tasks, such as table detection, table structure recognition, table querying, and table question answering, by leveraging the capabilities of large language models (LLMs). Moreover, the concept synergy mechanism enables table perception-related and comprehension-related tasks to work in harmony, as they can effectively leverage the needed clues from the corresponding source perception embeddings. Furthermore, to better evaluate the VTU task in real-world scenarios, we establish a new and comprehensive table VQA benchmark, ComTQA, featuring approximately 9,000 QA pairs. Extensive quantitative and qualitative experiments on both table perception and comprehension tasks, conducted across various public benchmarks, validate the effectiveness of our TabPedia. The superior performance further confirms the feasibility of using LLMs for understanding visual tables when all concepts work in synergy. The benchmark ComTQA has been open-sourced at <a class="link-external link-https" href="https://huggingface.co/datasets/ByteDance/ComTQA" rel="external noopener nofollow">this https URL</a>. The source code and model also have been released athttps://github.com/zhaowc-ustc/TabPedia.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing methods in handling the Visual Table Understanding (VTU) tasks. Specifically: 1. **Limitations of task - specific architectures**: Previous methods usually design specific architectures and objectives for each VTU subtask, which leads to pattern isolation and complex processes. For example, tasks such as table detection, table structure recognition, table query, and table question - answering are independent of each other and lack a unified framework. 2. **Challenges in multi - modal information integration**: VTU tasks need to handle visual - semantic representations at different granularities and levels, which poses a challenge to the model's multi - modal information integration ability. Although traditional LVLMs have made significant progress in visual understanding, they still face difficulties in two - dimensional table parsing and understanding. 3. **Insufficient evaluation in real - world scenarios**: Existing VTU benchmark datasets cannot fully reflect the complexity and diversity in the real world, especially in tasks involving table content understanding and reasoning. To address these problems, the author proposes a new large - scale visual - language model TabPedia, whose main contributions include: - **Unified framework**: By introducing the concept - synergy mechanism, all VTU tasks and multi - source visual embeddings are abstracted into concepts, achieving seamless integration of VTU tasks. - **Concept - synergy mechanism**: By introducing meditative tokens, the perception and understanding tasks can work in synergy, thus making more effective use of useful information in multi - source visual embeddings and task instructions. - **New benchmark dataset**: A new comprehensive table VQA benchmark dataset ComTQA, containing approximately 9,000 question - answer pairs, is constructed to more comprehensively evaluate the performance of VTU tasks in real - world scenarios. Through these innovations, TabPedia not only verifies its effectiveness on multiple public benchmark datasets but also shows its potential in more complex and realistic tasks. ### Formula Examples To ensure the correctness and readability of formulas, the following are some formula examples that may be involved in the paper (assuming relevant formulas exist): - **Loss function**: \[ L = -\sum_{i = 1}^{N}\log P(y_i|x_i) \] where \(N\) is the number of samples, and \(P(y_i|x_i)\) is the predicted probability given the input \(x_i\). - **Tree Edit Distance Similarity (TEDS)**: \[ \text{TEDS}(T_1, T_2)=1-\frac{\text{TED}(T_1, T_2)}{\max(\text{size}(T_1),\text{size}(T_2))} \] where \(T_1\) and \(T_2\) are two HTML table trees, and \(\text{TED}(T_1, T_2)\) is the tree edit distance between them. - **GriTS Metric**: \[ \text{GriTS}=\alpha\cdot\text{Top}+\beta\cdot\text{Cont}+\gamma\cdot\text{Loc} \] where \(\alpha, \beta, \gamma\) are weighting parameters that measure the recognition accuracy of cell topology, content, and location respectively. These formulas help explain the training and evaluation processes of the model, ensuring the accurate conveyance of technical details.