Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Junnan Li,Ramprasaath R. Selvaraju,Akhilesh Deepak Gotmare,Shafiq Joty,Caiming Xiong,Steven Hoi
DOI: https://doi.org/10.48550/arXiv.2107.07651
2021-10-07
Abstract:Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images. In order to improve learning from noisy web data, we propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. We provide a theoretical analysis of ALBEF from a mutual information maximization perspective, showing that different training tasks can be interpreted as different ways to generate views for an image-text pair. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets. On VQA and NLVR$^2$, ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-of-the-art, while enjoying faster inference speed. Code and pre-trained models are available at <a class="link-external link-https" href="https://github.com/salesforce/ALBEF/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are several key limitations in the existing vision - and - language pre - training methods: 1. **Feature Alignment Problem**: Existing image features and text word embeddings are located in different spaces, which makes it challenging for the multimodal encoder to learn the interactions between them. 2. **Dependence on Object Detectors**: Most existing methods rely on pre - trained object detectors to extract region - based image features. This not only requires a large amount of annotated data but also requires high - resolution images during inference, increasing the computational cost. 3. **Impact of Noisy Data**: Widely used image - text datasets (such as data collected from the web) are inherently noisy, and existing pre - training objectives (such as masked language modeling) may over - fit these noisy texts, thereby reducing the generalization performance of the model. To address these challenges, the authors propose the **ALign Before Fuse (ALBEF)** model, and its main contributions are as follows: - **Introduction of Contrastive Loss**: By applying image - text contrast (ITC) loss on the representations of unimodal encoders, the image and text representations are aligned before fusion. This helps the multimodal encoder to perform cross - modal learning better. - **Momentum Distillation**: A momentum distillation (Momentum Distillation, MoD) method is proposed. By using the momentum model to generate pseudo - labels as an additional supervision signal, the model's learning ability on noisy data is improved. - **Theoretical Analysis**: A theoretical analysis of ALBEF is provided from the perspective of mutual information maximization, explaining that different training tasks can be regarded as methods for generating different views of image - text pairs. Through these improvements, ALBEF has achieved state - of - the - art performance on multiple downstream vision - language tasks, especially outstanding in image - text retrieval, visual question answering (VQA), and natural language visual reasoning (NLVR2) tasks. Specifically, ALBEF has achieved absolute performance improvements of 2.37% and 3.84% on the VQA and NLVR2 tasks respectively, while enjoying a faster inference speed.