Efficient Vision-Language Pre-training by Cluster Masking

Zihao Wei,Zixuan Pan,Andrew Owens
2024-05-15
Abstract:We propose a simple strategy for masking image patches during visual-language contrastive learning that improves the quality of the learned representations and the training speed. During each iteration of training, we randomly mask clusters of visually similar image patches, as measured by their raw pixel intensities. This provides an extra learning signal, beyond the contrastive training itself, since it forces a model to predict words for masked visual structures solely from context. It also speeds up training by reducing the amount of data used in each image. We evaluate the effectiveness of our model by pre-training on a number of benchmarks, finding that it outperforms other masking strategies, such as FLIP, on the quality of the learned representation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper proposes a new visual language pre-training strategy called "Cluster Masking". In traditional contrastive learning methods, random patches of images are masked to improve efficiency and learning quality. However, this approach may lose semantically related information. To solve this problem, the paper suggests randomly masking clusters of image patches with similar pixel intensities during training. This not only speeds up training but also improves the quality of learned representations through contextual prediction. The main contributions of the paper are as follows: 1. Using the original RGB values as features, image patch clusters to be masked are determined through clustering. This method leverages simple visual similarity to capture visual structures like object parts. 2. Masking entire clusters instead of individual patches allows the model to predict descriptive words for missing scene structures from context, thereby enhancing representation learning. 3. Compared to methods that only randomly mask image patches, this approach improves representation accuracy while maintaining training efficiency. Experimental results show that this method outperforms other masking strategies in multiple downstream tasks such as zero-shot classification, linear probing, text and image retrieval, and language composition benchmarks. Furthermore, the study found that using features learned by the model for clustering can further improve performance.