Abstract:Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images. In order to improve learning from noisy web data, we propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. We provide a theoretical analysis of ALBEF from a mutual information maximization perspective, showing that different training tasks can be interpreted as different ways to generate views for an image-text pair. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets. On VQA and NLVR$^2$, ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-of-the-art, while enjoying faster inference speed. Code and pre-trained models are available at <a class="link-external link-https" href="https://github.com/salesforce/ALBEF/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are several key limitations in the existing vision - and - language pre - training methods: 1. **Feature Alignment Problem**: Existing image features and text word embeddings are located in different spaces, which makes it challenging for the multimodal encoder to learn the interactions between them. 2. **Dependence on Object Detectors**: Most existing methods rely on pre - trained object detectors to extract region - based image features. This not only requires a large amount of annotated data but also requires high - resolution images during inference, increasing the computational cost. 3. **Impact of Noisy Data**: Widely used image - text datasets (such as data collected from the web) are inherently noisy, and existing pre - training objectives (such as masked language modeling) may over - fit these noisy texts, thereby reducing the generalization performance of the model. To address these challenges, the authors propose the **ALign Before Fuse (ALBEF)** model, and its main contributions are as follows: - **Introduction of Contrastive Loss**: By applying image - text contrast (ITC) loss on the representations of unimodal encoders, the image and text representations are aligned before fusion. This helps the multimodal encoder to perform cross - modal learning better. - **Momentum Distillation**: A momentum distillation (Momentum Distillation, MoD) method is proposed. By using the momentum model to generate pseudo - labels as an additional supervision signal, the model's learning ability on noisy data is improved. - **Theoretical Analysis**: A theoretical analysis of ALBEF is provided from the perspective of mutual information maximization, explaining that different training tasks can be regarded as methods for generating different views of image - text pairs. Through these improvements, ALBEF has achieved state - of - the - art performance on multiple downstream vision - language tasks, especially outstanding in image - text retrieval, visual question answering (VQA), and natural language visual reasoning (NLVR2) tasks. Specifically, ALBEF has achieved absolute performance improvements of 2.37% and 3.84% on the VQA and NLVR2 tasks respectively, while enjoying a faster inference speed.

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

ContrastAlign: Toward Robust BEV Feature Alignment via Contrastive Learning for Multi-Modal 3D Object Detection

An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Enhancing Perception Capabilities of Multimodal LLMs with Training-free Fusion

Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition

VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Contrastive Vision-Language Alignment Makes Efficient Instruction Learner

Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Multi-level textual-visual alignment and fusion network for multimodal aspect-based sentiment analysis

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Explicit Alignment Objectives for Multilingual Bidirectional Encoders

DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning

Vision-Language Pre-Training with Triple Contrastive Learning