UNITER: UNiversal Image-TExt Representation Learning

Yen-Chun Chen,Linjie Li,Licheng Yu,Ahmed El Kholy,Faisal Ahmed,Zhe Gan,Yu Cheng,Jingjing Liu

DOI: https://doi.org/10.48550/arXiv.1909.11740

2020-07-18

Abstract:Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR$^2$. Code is available at <a class="link-external link-https" href="https://github.com/ChenRocks/UNITER" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Computation and Language,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to develop a general image - text representation model to support various Vision - and - Language (V+L) tasks. Specifically, the paper introduces a model named UNITER. This model learns a joint multi - modal embedding that can be applied to different downstream V+L tasks by pre - training on four large - scale image - text datasets (COCO, Visual Genome, Conceptual Captions and SBU Captions). These tasks include but are not limited to Visual Question Answering (VQA), Image - Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment and NLVR2, etc. The main contributions of the paper are as follows: 1. **Proposing UNITER**: A powerful general - purpose image - text representation model suitable for multiple V+L tasks. 2. **Conditional Masking**: Using a conditional masking strategy in Masked Language Modeling (MLM) and Masked Region Modeling (MRM), that is, masking only on one modality while keeping the other modality intact. 3. **Word - Region Alignment based on Optimal Transport (WRA)**: Introducing a new pre - training task, using Optimal Transport (OT) to explicitly promote fine - grained alignment between words and image regions. 4. **Achieving new states on multiple V+L benchmarks**: Achieving significant performance improvements in a wide range of V+L benchmark tests, surpassing existing multi - modal pre - training methods. Through these contributions, the paper aims to solve the problems of the existing model architectures being diverse, the learned representations being highly task - specific and difficult to generalize to other tasks, thus promoting the research in the V+L field to move forward.

UNITER: UNiversal Image-TExt Representation Learning

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

UNIT: Unifying Image and Text Recognition in One Vision Encoder

UnIVAL: Unified Model for Image, Video, Audio and Language Tasks

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level Grounding of Images

Enhancing Vision-Language Model with Unmasked Token Alignment

UNIMO: Towards Unified-Modal Understanding and Generation Via Cross-Modal Contrastive Learning

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Unified Contrastive Learning in Image-Text-Label Space

SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

Unified Vision-Language Pre-Training for Image Captioning and VQA

UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

Multimodal Pre-training Method for Vision-language Understanding and Generation.

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation

Towards More Unified In-context Visual Understanding

TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding