Textual Grounding for Open-vocabulary Visual Information Extraction in Layout-diversified Documents

MENGJUN CHENG,Chengquan Zhang,Chang Liu,Yuke Li,Bohan Li,Kun Yao,Xiawu Zheng,Rongrong Ji,Jie Chen
DOI: https://doi.org/10.1007/978-3-031-72995-9_27
2024-01-01
Abstract:Current methodologies have achieved notable success in the closed-set visual information extraction task, while the exploration into open-vocabulary settings is comparatively underdeveloped, which is practical for individual users in terms of inferring information across documents of diverse types. Existing proposal solutions including NER methods and LLM-based methods fall short in processing the unlimited range of open-vocabulary keys and missing explicit layout modeling. This paper introduces a novel method for tackling the given challenge by transforming the process of categorizing text tokens into a task of locating regions based on given queries also called textual grounding. Particularly, we take this a step further by pairing open-vocabulary key language embedding with corresponding grounded text visual embedding. We design a document-tailored grounding framework by incorporating the layout-aware context learning and document-tailored two-stage pre-training, which significantly improves the model's understanding of documents. Our method outperforms current proposal solutions on the SVRD benchmark for the open-vocabulary VIE task, offering lower costs and faster inference speed. Specifically, our method infers 20x faster than the QwenVL model and achieves an improvement of about 24.3% for the F-score.
What problem does this paper attempt to address?