Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN

Siyuan Li,Di Wu,Fang Wu,Zelin Zang,Stan.Z.Li

2023-06-02

Abstract:Masked image modeling, an emerging self-supervised pre-training method, has shown impressive success across numerous downstream vision tasks with Vision transformers. Its underlying idea is simple: a portion of the input image is masked out and then reconstructed via a pre-text task. However, the working principle behind MIM is not well explained, and previous studies insist that MIM primarily works for the Transformer family but is incompatible with CNNs. In this work, we observe that MIM essentially teaches the model to learn better middle-order interactions among patches for more generalized feature extraction. We then propose an Architecture-Agnostic Masked Image Modeling framework (A$^2$MIM), which is compatible with both Transformers and CNNs in a unified way. Extensive experiments on popular benchmarks show that A$^2$MIM learns better representations without explicit design and endows the backbone model with the stronger capability to transfer to various downstream tasks.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the problem of how to enhance the generalization ability and downstream task performance of models in computer vision tasks through self-supervised pre-training methods (specifically Masked Image Modeling, MIM), while ensuring that this method is not only applicable to Transformer architectures but also compatible with Convolutional Neural Networks (CNNs). Specifically, the paper focuses on the following points: 1. **The Essence of MIM**: Existing research suggests that MIM is mainly suitable for Transformer architectures but less so for CNNs. Through systematic experiments, the paper reveals that the core of MIM lies in teaching the model to learn better mid-level interactions (i.e., intermediate-level interactions between image patches) rather than simply improving reconstruction quality. These mid-level interactions help in extracting more generalizable features. 2. **Architecture-Agnostic MIM Framework**: Based on the above findings, the paper proposes an architecture-agnostic MIM framework (A2MIM), which can be applied to both Transformers and CNNs without relying on specific designs. A2MIM enhances the model's mid-level interaction capabilities by performing masking operations on intermediate feature maps and introducing frequency domain loss. 3. **Experimental Validation**: The paper demonstrates the effectiveness of the A2MIM framework through extensive experiments on multiple popular benchmark datasets. The experimental results show that A2MIM can learn better representations, improving the model's transferability across various downstream tasks. In summary, the paper aims to bridge the gap of MIM between different network architectures and improve the MIM method so that CNNs can also benefit from this self-supervised pre-training approach.

Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN

Understanding Masked Image Modeling via Learning Occlusion Invariant Feature

Improving Pixel-based MIM by Reducing Wasted Modeling Capability

Symmetric masking strategy enhances the performance of Masked Image Modeling

SimMIM: A Simple Framework for Masked Image Modeling

PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling

BIM: Block-Wise Self-Supervised Learning with Masked Image Modeling

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

Improve Supervised Representation Learning with Masked Image Modeling

Attention-Guided Contrastive Masked Image Modeling for Transformer-Based Self-Supervised Learning

Masked Image Modeling with Local Multi-Scale Reconstruction.

Masked Image Modeling with Denoising Contrast

Morphing Tokens Draw Strong Masked Image Models

Green Hierarchical Vision Transformer for Masked Image Modeling

Img2Vec: A Teacher of High Token-Diversity Helps Masked AutoEncoders

Learning with Unmasked Tokens Drives Stronger Vision Learners

Efficient Masked Autoencoders with Self-Consistency

Contrastive Masked Autoencoders are Stronger Vision Learners

SG-MIM: Structured Knowledge Guided Efficient Pre-training for Dense Prediction

Masked Channel Modeling for Bootstrapping Visual Pre-training

Beyond [cls]: Exploring the true potential of Masked Image Modeling representations