Abstract:Large vision models have achieved great success in computer vision recently, e.g., CLIP for large-scale image-text contrastive learning. They have prominent potential in representation learning and show strong transfer ability in various downstream tasks. However, directly training a larger CLIP model from scratch is difficult because of the enormous training cost, unstable training, and difficulty in collecting a large amount of training data. In this work, we aim to scale the sizes of CLIP models and extend their strong capabilities with self-supervised representation learning. We introduce masked channel modeling (MCM), a new self-supervised learning framework that randomly masks the input feature maps extracted by a CLIP model and reconstructs the missing features. Unlike masked image modeling (MIM) which takes raw pixels as the input and output, MCM performs masked modeling at a high-dimensional semantic space by masking random channels of the visual features and reconstructing the corrupted channels. We show that channel maps are a great fit for masked modeling, as the visual features are semantically structured across channels. We demonstrate that our method can easily scale up the CLIP model at a low training cost, and extend its capabilities on zero-shot learning, few-shot learning, and end-to-end fine-tuning. Based on CLIP ViT-L, MCM improves the zero-shot image classification accuracy by 0.5% averaged over 8 benchmarks. With a few samples, e.g., 1-shot or 2-shot, MCM achieves significant improvements when adapting to 11 image classification benchmarks. In addition, MCM shows strong performance when end-to-end fine-tuned on different downstream tasks, e.g., improving CLIP ViT-B by 0.9% top-1 accuracy on ImageNet-1K classification and 2.5% mIoU on ADE20K semantic segmentation.

MIMIC: Masked Image Modeling with Image Correspondences

MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning

Revealing the Dark Secrets of Masked Image Modeling

Delving Deeper into Data Scaling in Masked Image Modeling

SG-MIM: Structured Knowledge Guided Efficient Pre-training for Dense Prediction

GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

Masked Image Modeling with Local Multi-Scale Reconstruction.

Scaling Efficient Masked Image Modeling on Large Remote Sensing Dataset

On Data Scaling in Masked Image Modeling

PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling

CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion

Adept: Annotation-denoising Auxiliary Tasks with Discrete Cosine Transform Map and Keypoint for Human-Centric Pretraining

Improving Pixel-based MIM by Reducing Wasted Modeling Capability

Masked Channel Modeling for Bootstrapping Visual Pre-training

Masked Image Modeling Advances 3D Medical Image Analysis

SimMIM: A Simple Framework for Masked Image Modeling

MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning

OpticalRS-4M: Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset

Masked Image Modeling: A Survey

BIM: Block-Wise Self-Supervised Learning with Masked Image Modeling