Masked Channel Modeling for Bootstrapping Visual Pre-training
Yang Liu,Xinlong Wang,Muzhi Zhu,Yue Cao,Tiejun Huang,Chunhua Shen
DOI: https://doi.org/10.1007/s11263-024-02204-6
IF: 13.369
2024-08-19
International Journal of Computer Vision
Abstract:Large vision models have achieved great success in computer vision recently, e.g., CLIP for large-scale image-text contrastive learning. They have prominent potential in representation learning and show strong transfer ability in various downstream tasks. However, directly training a larger CLIP model from scratch is difficult because of the enormous training cost, unstable training, and difficulty in collecting a large amount of training data. In this work, we aim to scale the sizes of CLIP models and extend their strong capabilities with self-supervised representation learning. We introduce masked channel modeling (MCM), a new self-supervised learning framework that randomly masks the input feature maps extracted by a CLIP model and reconstructs the missing features. Unlike masked image modeling (MIM) which takes raw pixels as the input and output, MCM performs masked modeling at a high-dimensional semantic space by masking random channels of the visual features and reconstructing the corrupted channels. We show that channel maps are a great fit for masked modeling, as the visual features are semantically structured across channels. We demonstrate that our method can easily scale up the CLIP model at a low training cost, and extend its capabilities on zero-shot learning, few-shot learning, and end-to-end fine-tuning. Based on CLIP ViT-L, MCM improves the zero-shot image classification accuracy by 0.5% averaged over 8 benchmarks. With a few samples, e.g., 1-shot or 2-shot, MCM achieves significant improvements when adapting to 11 image classification benchmarks. In addition, MCM shows strong performance when end-to-end fine-tuned on different downstream tasks, e.g., improving CLIP ViT-B by 0.9% top-1 accuracy on ImageNet-1K classification and 2.5% mIoU on ADE20K semantic segmentation.
computer science, artificial intelligence