Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN

Siyuan Li,Di Wu,Fang Wu,Zelin Zang,Stan.Z.Li
2023-06-02
Abstract:Masked image modeling, an emerging self-supervised pre-training method, has shown impressive success across numerous downstream vision tasks with Vision transformers. Its underlying idea is simple: a portion of the input image is masked out and then reconstructed via a pre-text task. However, the working principle behind MIM is not well explained, and previous studies insist that MIM primarily works for the Transformer family but is incompatible with CNNs. In this work, we observe that MIM essentially teaches the model to learn better middle-order interactions among patches for more generalized feature extraction. We then propose an Architecture-Agnostic Masked Image Modeling framework (A$^2$MIM), which is compatible with both Transformers and CNNs in a unified way. Extensive experiments on popular benchmarks show that A$^2$MIM learns better representations without explicit design and endows the backbone model with the stronger capability to transfer to various downstream tasks.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the problem of how to enhance the generalization ability and downstream task performance of models in computer vision tasks through self-supervised pre-training methods (specifically Masked Image Modeling, MIM), while ensuring that this method is not only applicable to Transformer architectures but also compatible with Convolutional Neural Networks (CNNs). Specifically, the paper focuses on the following points: 1. **The Essence of MIM**: Existing research suggests that MIM is mainly suitable for Transformer architectures but less so for CNNs. Through systematic experiments, the paper reveals that the core of MIM lies in teaching the model to learn better mid-level interactions (i.e., intermediate-level interactions between image patches) rather than simply improving reconstruction quality. These mid-level interactions help in extracting more generalizable features. 2. **Architecture-Agnostic MIM Framework**: Based on the above findings, the paper proposes an architecture-agnostic MIM framework (A2MIM), which can be applied to both Transformers and CNNs without relying on specific designs. A2MIM enhances the model's mid-level interaction capabilities by performing masking operations on intermediate feature maps and introducing frequency domain loss. 3. **Experimental Validation**: The paper demonstrates the effectiveness of the A2MIM framework through extensive experiments on multiple popular benchmark datasets. The experimental results show that A2MIM can learn better representations, improving the model's transferability across various downstream tasks. In summary, the paper aims to bridge the gap of MIM between different network architectures and improve the MIM method so that CNNs can also benefit from this self-supervised pre-training approach.