Abstract:Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR$^2$. Code is available at <a class="link-external link-https" href="https://github.com/ChenRocks/UNITER" rel="external noopener nofollow">this https URL</a>.

UC: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training—-Supplement Material

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Multimodal Pretraining from Monolingual to Multilingual

RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training.

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training

Towards Zero-shot Cross-lingual Image Retrieval and Tagging

Multilingual Translation with Extensible Multilingual Pretraining and Finetuning

UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models

CCMB: A Large-scale Chinese Cross-modal Benchmark

M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Multilingual Vision-Language Pre-training for the Remote Sensing Domain

UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot Vision-Language Tasks

Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation

M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining

UNITER: UNiversal Image-TExt Representation Learning

Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training

Uni-Mlip: Unified Self-supervision for Medical Vision Language Pre-training