Abstract:This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper primarily explores how to effectively utilize Masked Autoencoders (MAE) for self-supervised learning in the field of computer vision. Specifically: 1. **Addressing the Differences Between Vision and Language**: - In the field of Natural Language Processing (NLP), self-supervised pre-training has achieved significant success, such as through masked language models (e.g., BERT) or autoregressive models (e.g., GPT). However, in the field of computer vision, despite similar ideas, progress has been relatively slow. - The authors attempt to analyze and address the differences between vision and language to improve the effectiveness of self-supervised learning in computer vision. 2. **Efficient Large-Scale Model Training**: - A non-symmetric encoder-decoder architecture is proposed, where the encoder only processes visible image patches, and the decoder is responsible for reconstructing the missing parts. This method can efficiently train large-scale models and significantly reduce pre-training time and memory consumption. - By masking a large number of input image patches (e.g., 75%), a challenging self-supervised task is created, which helps in learning more useful feature representations. 3. **Improving Image Reconstruction Quality**: - Using normalized pixel values as the reconstruction target improves the quality of the representations. Experiments show that this method performs better across various tasks. 4. **Comparison with Existing Methods**: - By comparing with existing self-supervised learning methods (e.g., DINO, MoCo v3, BEiT, etc.), the paper demonstrates the superior performance of MAE across different model sizes, especially achieving the best accuracy without using external data. In summary, this paper aims to propose a simple, efficient, and scalable masked autoencoder method to improve the effectiveness of self-supervised learning in the field of computer vision and demonstrates its excellent performance in image classification and downstream tasks.

Masked Autoencoders Are Scalable Vision Learners

Masked Autoencoders for Point Cloud Self-supervised Learning.

Masked Autoencoders are Efficient Class Incremental Learners

GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds

Masked Autoencoders As Image Processors.

SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Masked Autoencoders As Spatiotemporal Learners

VideoMAC: Video Masked Autoencoders Meet ConvNets

Masked autoencoders are effective solution to transformer data-hungry

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

SdAE: Self-distillated Masked Autoencoder

Understanding Masked Autoencoders From a Local Contrastive Perspective

Improving Masked Autoencoders by Learning Where to Mask

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Masked Autoencoders Are Robust Neural Architecture Search Learners

Contrastive Masked Autoencoders are Stronger Vision Learners

How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders

Rethinking Patch Dependence for Masked Autoencoders

Teaching Masked Autoencoder With Strong Augmentations