LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining

Huawen Shen,Gengluo Li,Jinwen Zhong,Yu Zhou
2024-12-19
Abstract:Visual Information Extraction (VIE) plays a crucial role in the comprehension of semi-structured documents, and several pre-trained models have been developed to enhance performance. However, most of these works are monolingual (usually English). Due to the extremely unbalanced quantity and quality of pre-training corpora between English and other languages, few works can extend to non-English scenarios. In this paper, we conduct systematic experiments to show that vision and layout modality hold invariance among images with different languages. If decoupling language bias from document images, a vision-layout-based model can achieve impressive cross-lingual generalization. Accordingly, we present a simple but effective multilingual training paradigm LDP (Language Decoupled Pre-training) for better utilization of monolingual pre-training data. Our proposed model LDM (Language Decoupled Model) is first pre-trained on the language-independent data, where the language knowledge is decoupled by a diffusion model, and then the LDM is fine-tuned on the downstream languages. Extensive experiments show that the LDM outperformed all SOTA multilingual pre-trained models, and also maintains competitiveness on downstream monolingual/English benchmarks.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the insufficient model generalization ability in non - English scenarios in multilingual Visual Information Extraction (VIE). Specifically: 1. **Problem Background**: - At present, most visual information extraction models are monolingual, especially English - centric. - The English corpus far exceeds other languages in quantity and quality, making it difficult for existing models to be extended to non - English scenarios. 2. **Research Motivation**: - Researchers have found through systematic experiments that the visual and layout modalities in images are invariant across different languages. - If the language bias can be decoupled from document images, the visual - and - layout - based models can achieve impressive cross - language generalization. 3. **Solution**: - A simple and effective multilingual training paradigm LDP (Language Decoupled Pre - training) is proposed to make better use of monolingual pre - training data. - LDM (Language Decoupled Model) is first pre - trained on language - independent data, where language knowledge is decoupled through a diffusion model, and then fine - tuned in downstream tasks. 4. **Key Innovation Points**: - By decoupling language bias, the model can generalize better in multilingual scenarios. - The MTIM (Multi - Token Information Merging) module is introduced to integrate information from multiple bounding boxes, enhancing the model's multimodal processing ability. - The LKI (Language Knowledge Inserting) module is introduced in the fine - tuning stage to re - integrate the decoupled language information into downstream tasks. 5. **Experimental Verification**: - A large number of experiments show that LDM outperforms all existing state - of - the - art models in multilingual benchmarks, while also remaining competitive in monolingual/English benchmarks. In summary, this paper aims to improve the generalization ability of visual information extraction models in multilingual scenarios by decoupling language bias, thereby solving the problem of poor performance of existing models in non - English scenarios.