Efficient Vision-Language Pre-training by Cluster Masking

Zihao Wei,Zixuan Pan,Andrew Owens

2024-05-15

Abstract:We propose a simple strategy for masking image patches during visual-language contrastive learning that improves the quality of the learned representations and the training speed. During each iteration of training, we randomly mask clusters of visually similar image patches, as measured by their raw pixel intensities. This provides an extra learning signal, beyond the contrastive training itself, since it forces a model to predict words for masked visual structures solely from context. It also speeds up training by reducing the amount of data used in each image. We evaluate the effectiveness of our model by pre-training on a number of benchmarks, finding that it outperforms other masking strategies, such as FLIP, on the quality of the learned representation.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper proposes a new visual language pre-training strategy called "Cluster Masking". In traditional contrastive learning methods, random patches of images are masked to improve efficiency and learning quality. However, this approach may lose semantically related information. To solve this problem, the paper suggests randomly masking clusters of image patches with similar pixel intensities during training. This not only speeds up training but also improves the quality of learned representations through contextual prediction. The main contributions of the paper are as follows: 1. Using the original RGB values as features, image patch clusters to be masked are determined through clustering. This method leverages simple visual similarity to capture visual structures like object parts. 2. Masking entire clusters instead of individual patches allows the model to predict descriptive words for missing scene structures from context, thereby enhancing representation learning. 3. Compared to methods that only randomly mask image patches, this approach improves representation accuracy while maintaining training efficiency. Experimental results show that this method outperforms other masking strategies in multiple downstream tasks such as zero-shot classification, linear probing, text and image retrieval, and language composition benchmarks. Furthermore, the study found that using features learned by the model for clustering can further improve performance.

Efficient Vision-Language Pre-training by Cluster Masking

Masked Image Contrastive Learning for Efficient Visual Conceptual Pre-training

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

Scaling Language-Image Pre-training via Masking

Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining.

Centered Masking for Language-Image Pre-Training

Uniform Masking Prevails in Vision-Language Pretraining

Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

Leveraging per Image-Token Consistency for Vision-Language Pre-training

Toward High Quality Facial Representation Learning

SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

Enhancing Vision-Language Model with Unmasked Token Alignment

Layer Grafted Pre-training: Bridging Contrastive Learning And Masked Image Modeling For Label-Efficient Representations

Masked Channel Modeling for Bootstrapping Visual Pre-training

Train No Evil: Selective Masking for Task-Guided Pre-Training

Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics

Attentive Mask CLIP

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

Masking Improves Contrastive Self-Supervised Learning for ConvNets, and Saliency Tells You Where

MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining