OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition

Jianqiang Wan,Sibo Song,Wenwen Yu,Yuliang Liu,Wenqing Cheng,Fei Huang,Xiang Bai,Cong Yao,Zhibo Yang
2024-03-28
Abstract:Recently, visually-situated text parsing (VsTP) has experienced notable advancements, driven by the increasing demand for automated document understanding and the emergence of Generative Large Language Models (LLMs) capable of processing document-based questions. Various methods have been proposed to address the challenging problem of VsTP. However, due to the diversified targets and heterogeneous schemas, previous works usually design task-specific architectures and objectives for individual tasks, which inadvertently leads to modal isolation and complex workflow. In this paper, we propose a unified paradigm for parsing visually-situated text across diverse scenarios. Specifically, we devise a universal model, called OmniParser, which can simultaneously handle three typical visually-situated text parsing tasks: text spotting, key information extraction, and table recognition. In OmniParser, all tasks share the unified encoder-decoder architecture, the unified objective: point-conditioned text generation, and the unified input & output representation: prompt & structured sequences. Extensive experiments demonstrate that the proposed OmniParser achieves state-of-the-art (SOTA) or highly competitive performances on 7 datasets for the three visually-situated text parsing tasks, despite its unified, concise design. The code is available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main goal of this paper is to propose a unified framework (named OmniParser) to address three key issues in visual text parsing tasks: Text Spotting, Key Information Extraction (KIE), and Table Recognition. Specifically, the paper attempts to solve the following problems: 1. **Unified handling of multiple tasks**: Existing methods typically use customized architectures and objective functions for each task when dealing with text spotting, key information extraction, and table recognition, resulting in complex and less generalizable models. OmniParser aims to address these issues through a single unified architecture to improve model generality and efficiency. 2. **Processing of structured information**: For complex document images, it is necessary to handle structured information such as text and tables. OmniParser enhances the ability to parse this structured information through a two-stage decoding strategy, thereby improving the model's interpretability. 3. **Multimodal dependency and fusion**: Traditional solutions often rely on Optical Character Recognition (OCR) engines, which limits the model's performance and generalization ability. OmniParser reduces dependency on external OCR systems by learning structured information directly from images. 4. **Unified representation and objectives**: To achieve the above goals, OmniParser designs a unified encoder-decoder architecture shared by all tasks. Additionally, a unified objective function—point-conditioned text generation—and unified input-output representations—prompts and structured sequences—are proposed. In summary, the core contribution of this paper is the proposal of a unified framework capable of simultaneously handling text spotting, key information extraction, and table recognition. By simplifying the model structure, reducing task isolation, and optimizing the understanding of structured information, it significantly enhances the ability to handle visual text parsing tasks.