Abstract:PDF HTML XML Export Cite reminder Multimodal Pre-training Method for Vision-language Understanding and Generation DOI: 10.21655/ijsi.1673-7288.00315 Author: Affiliation: Clc Number: Fund Project: Article | Figures | Metrics | Reference | Related | Cited by | Materials | Comments Abstract:Most existing vision-language pre-training methods focus on understanding tasks and use BERT-like loss functions (masked language modeling and image-text matching) during pre-training. Despite their good performance in the understanding of downstream tasks, such as visual question answering, image-text retrieval, and visual entailment, these methods cannot generate information. To tackle this problem, this study proposes Unified multimodal pre-training for Vision-Language understanding and generation (UniVL). The proposed UniVL is capable of handling both understanding tasks and generation tasks. It expands existing pre-training paradigms and uses random masks and causal masks simultaneously, where causal masks are triangular masks that mask future tokens, and such pre-trained models can have autoregressive generation abilities. Moreover, several vision-language understanding tasks are turned into text generation tasks according to specifications, and the prompt-based method is employed for fine-tuning of different downstream tasks. The experiments show that there is a trade-off between understanding tasks and generation tasks when the same model is used, and a feasible way to improve both tasks is to use more data. The proposed UniVL framework attains comparable performance to recent vision-language pre-training methods in both understanding tasks and generation tasks. Moreover, the prompt-based generation method is more effective and even outperforms discriminative methods in few-shot scenarios. Reference Related Cited by

ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation

ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding

ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation

ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graphs

Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks

ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation

Multimodal Pre-training Method for Vision-language Understanding and Generation.

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

UNITER: UNiversal Image-TExt Representation Learning

Ernie: Enhanced Language Representation With Informative Entities

ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding.

Multimodal Pretraining from Monolingual to Multilingual

Towards More Unified In-context Visual Understanding

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation