Abstract:Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps Vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. But for supervised models, the diversity on attention heads almost disappears from the last three layers and less diversity harms the fine-tuning performance. From the experiments, we find that MIM models can perform significantly better on geometric and motion tasks with weak semantics or fine-grained classification tasks, than their supervised counterparts. Without bells and whistles, a standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the categories are sufficiently covered by the supervised pre-training, MIM models can still achieve highly competitive transfer performance. With a deeper understanding of MIM, we hope that our work can inspire new and solid research in this direction. Code will be available at https://github.com/zdaxie/MIM-DarkSecrets.

SemDM: Task-oriented masking strategy for self-supervised visual learning

Disjoint Masking with Joint Distillation for Efficient Masked Image Modeling

When Masked Image Modeling Meets Source-free Unsupervised Domain Adaptation: Dual-Level Masked Network for Semantic Segmentation

Delving Deeper into Mask Utilization in Video Object Segmentation

SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders

M^3CS: Multi-Target Masked Point Modeling with Learnable Codebook and Siamese Decoders

Masked Image Modeling Boosting Semi-Supervised Semantic Segmentation

SemanticMIM: Marring Masked Image Modeling with Semantics Compression for General Visual Representation

Learning with Unmasked Tokens Drives Stronger Vision Learners

Self-Supervised Visual Representations Learning by Contrastive Mask Prediction

SimMIM: A Simple Framework for Masked Image Modeling

DPPMask: Masked Image Modeling with Determinantal Point Processes

MixMask: Revisiting Masking Strategy for Siamese ConvNets

Symmetric masking strategy enhances the performance of Masked Image Modeling

Masked Image Modeling with Denoising Contrast

BIM: Block-Wise Self-Supervised Learning with Masked Image Modeling

Revealing the Dark Secrets of Masked Image Modeling

Exploring the Coordination of Frequency and Attention in Masked Image Modeling

Efficient Masked Autoencoders with Self-Consistency

Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning

Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining.