Abstract:Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a finetuning fashion. The dominant VLP models adopt a CNN-Transformer architecture, which embeds images with a CNN, and then aligns images and texts with a Transformer. Visual relationship between visual contents plays an important role in image understanding and is crucial for inter-modal alignment learning in VLP. However, CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies. Thus the two objectives of learning visual relation and inter-modal alignment are encapsulated in the same Transformer network. Such design might restrict the inter-modal alignment learning in the Transformer by neglecting the specialized characteristic of each objective. To tackle this challenge, we propose a fully Transformer visual embedding for VLP to better learn visual relation and further promote inter-modal alignment. Specifically, we propose a metric named Inter-Modality Flow (IMF) to measure the interaction between vision and language (i.e., inter-modality). We also design a novel masking optimization mechanism named Masked Feature Regression (MFR) in Transformer to further promote the inter-modality learning. To the best of our knowledge, this is the first study to explore the benefit of Transformer for visual feature learning in VLP. We verify our method on a wide range of vision-language tasks, including Image-Text Retrieval, Visual Question Answering (VQA), Visual Entailment and Visual Reasoning. The result shows that our approach not only outperforms the state-of-the-art VLP models, but also exhibits superiority on the IMF metric.

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

Unified Vision-Language Pre-Training for Image Captioning and VQA

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning

CogVLM: Visual Expert for Pretrained Language Models

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

VatLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Unified Video-Language Pre-training with Synchronized Audio

Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

VL-Meta: Vision-Language Models for Multimodal Meta-Learning

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

Generalizing Multimodal Pre-training into Multilingual via Language Acquisition

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks