UNITER: UNiversal Image-TExt Representation Learning

Yen-Chun Chen,Linjie Li,Licheng Yu,Ahmed El Kholy,Faisal Ahmed,Zhe Gan,Yu Cheng,Jingjing Liu
DOI: https://doi.org/10.48550/arXiv.1909.11740
2020-07-18
Abstract:Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR$^2$. Code is available at <a class="link-external link-https" href="https://github.com/ChenRocks/UNITER" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to develop a general image - text representation model to support various Vision - and - Language (V+L) tasks. Specifically, the paper introduces a model named UNITER. This model learns a joint multi - modal embedding that can be applied to different downstream V+L tasks by pre - training on four large - scale image - text datasets (COCO, Visual Genome, Conceptual Captions and SBU Captions). These tasks include but are not limited to Visual Question Answering (VQA), Image - Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment and NLVR2, etc. The main contributions of the paper are as follows: 1. **Proposing UNITER**: A powerful general - purpose image - text representation model suitable for multiple V+L tasks. 2. **Conditional Masking**: Using a conditional masking strategy in Masked Language Modeling (MLM) and Masked Region Modeling (MRM), that is, masking only on one modality while keeping the other modality intact. 3. **Word - Region Alignment based on Optimal Transport (WRA)**: Introducing a new pre - training task, using Optimal Transport (OT) to explicitly promote fine - grained alignment between words and image regions. 4. **Achieving new states on multiple V+L benchmarks**: Achieving significant performance improvements in a wide range of V+L benchmark tests, surpassing existing multi - modal pre - training methods. Through these contributions, the paper aims to solve the problems of the existing model architectures being diverse, the learned representations being highly task - specific and difficult to generalize to other tasks, thus promoting the research in the V+L field to move forward.