Abstract:Visual Information Extraction (VIE) plays a crucial role in the comprehension of semi-structured documents, and several pre-trained models have been developed to enhance performance. However, most of these works are monolingual (usually English). Due to the extremely unbalanced quantity and quality of pre-training corpora between English and other languages, few works can extend to non-English scenarios. In this paper, we conduct systematic experiments to show that vision and layout modality hold invariance among images with different languages. If decoupling language bias from document images, a vision-layout-based model can achieve impressive cross-lingual generalization. Accordingly, we present a simple but effective multilingual training paradigm LDP (Language Decoupled Pre-training) for better utilization of monolingual pre-training data. Our proposed model LDM (Language Decoupled Model) is first pre-trained on the language-independent data, where the language knowledge is decoupled by a diffusion model, and then the LDM is fine-tuned on the downstream languages. Extensive experiments show that the LDM outperformed all SOTA multilingual pre-trained models, and also maintains competitiveness on downstream monolingual/English benchmarks.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the insufficient model generalization ability in non - English scenarios in multilingual Visual Information Extraction (VIE). Specifically: 1. **Problem Background**: - At present, most visual information extraction models are monolingual, especially English - centric. - The English corpus far exceeds other languages in quantity and quality, making it difficult for existing models to be extended to non - English scenarios. 2. **Research Motivation**: - Researchers have found through systematic experiments that the visual and layout modalities in images are invariant across different languages. - If the language bias can be decoupled from document images, the visual - and - layout - based models can achieve impressive cross - language generalization. 3. **Solution**: - A simple and effective multilingual training paradigm LDP (Language Decoupled Pre - training) is proposed to make better use of monolingual pre - training data. - LDM (Language Decoupled Model) is first pre - trained on language - independent data, where language knowledge is decoupled through a diffusion model, and then fine - tuned in downstream tasks. 4. **Key Innovation Points**: - By decoupling language bias, the model can generalize better in multilingual scenarios. - The MTIM (Multi - Token Information Merging) module is introduced to integrate information from multiple bounding boxes, enhancing the model's multimodal processing ability. - The LKI (Language Knowledge Inserting) module is introduced in the fine - tuning stage to re - integrate the decoupled language information into downstream tasks. 5. **Experimental Verification**: - A large number of experiments show that LDM outperforms all existing state - of - the - art models in multilingual benchmarks, while also remaining competitive in monolingual/English benchmarks. In summary, this paper aims to improve the generalization ability of visual information extraction models in multilingual scenarios by decoupling language bias, thereby solving the problem of poor performance of existing models in non - English scenarios.

LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining

Generalizing Multimodal Pre-training into Multilingual via Language Acquisition

Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

DEPT: Decoupled Embeddings for Pre-training Language Models

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Multimodal Pretraining from Monolingual to Multilingual

VLP: A Survey on Vision-language Pre-training

3D Vision and Language Pretraining with Large-Scale Synthetic Data

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

VILA: On Pre-training for Visual Language Models

End-to-End Unsupervised Vision-and-Language Pre-training with Referring Expression Matching.

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

GeoLayoutLM: Geometric Pre-training for Visual Information Extraction

Fine-Grained Semantically Aligned Vision-Language Pre-Training

Leveraging per Image-Token Consistency for Vision-Language Pre-training

Unified Vision-Language Pre-Training for Image Captioning and VQA

Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends

Unsupervised Domain Adaption Harnessing Vision-Language Pre-training

DLIP: Distilling Language-Image Pre-training