Abstract:Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a finetuning fashion. The dominant VLP models adopt a CNN-Transformer architecture, which embeds images with a CNN, and then aligns images and texts with a Transformer. Visual relationship between visual contents plays an important role in image understanding and is crucial for inter-modal alignment learning in VLP. However, CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies. Thus the two objectives of learning visual relation and inter-modal alignment are encapsulated in the same Transformer network. Such design might restrict the inter-modal alignment learning in the Transformer by neglecting the specialized characteristic of each objective. To tackle this challenge, we propose a fully Transformer visual embedding for VLP to better learn visual relation and further promote inter-modal alignment. Specifically, we propose a metric named Inter-Modality Flow (IMF) to measure the interaction between vision and language (i.e., inter-modality). We also design a novel masking optimization mechanism named Masked Feature Regression (MFR) in Transformer to further promote the inter-modality learning. To the best of our knowledge, this is the first study to explore the benefit of Transformer for visual feature learning in VLP. We verify our method on a wide range of vision-language tasks, including Image-Text Retrieval, Visual Question Answering (VQA), Visual Entailment and Visual Reasoning. The result shows that our approach not only outperforms the state-of-the-art VLP models, but also exhibits superiority on the IMF metric.

Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking

Probing Representations Learned by Multimodal Recurrent and Transformer Models

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Attention-Guided Contrastive Masked Image Modeling for Transformer-Based Self-Supervised Learning

MVP: Multimodality-Guided Visual Pre-training

Ibot: Image BERT Pre-Training with Online Tokenizer

VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

Word Representation Learning in Multimodal Pre-Trained Transformers: An Intrinsic Evaluation

UNITER: UNiversal Image-TExt Representation Learning

Probing the Role of Positional Information in Vision-Language Models

Interactive Image Segmentation with Cross-Modality Vision Transformers

Vman: visual-modified attention network for multimodal paradigms

MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling

Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case

Probing Cross-modal Semantics Alignment Capability from the Textual Perspective