Understanding Masked Autoencoders From a Local Contrastive Perspective

Xiaoyu Yue,Lei Bai,Meng Wei,Jiangmiao Pang,Xihui Liu,Luping Zhou,Wanli Ouyang

2023-12-08

Abstract:Masked AutoEncoder (MAE) has revolutionized the field of self-supervised learning with its simple yet effective masking and reconstruction strategies. However, despite achieving state-of-the-art performance across various downstream vision tasks, the underlying mechanisms that drive MAE's efficacy are less well-explored compared to the canonical contrastive learning paradigm. In this paper, we first propose a local perspective to explicitly extract a local contrastive form from MAE's reconstructive objective at the patch level. And then we introduce a new empirical framework, called Local Contrastive MAE (LC-MAE), to analyze both reconstructive and contrastive aspects of MAE. LC-MAE reveals that MAE learns invariance to random masking and ensures distribution consistency between the learned token embeddings and the original images. Furthermore, we dissect the contribution of the decoder and random masking to MAE's success, revealing both the decoder's learning mechanism and the dual role of random masking as data augmentation and effective receptive field restriction. Our experimental analysis sheds light on the intricacies of MAE and summarizes some useful design methodologies, which can inspire more powerful visual self-supervised methods.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem this paper attempts to address is understanding the working mechanism of Masked AutoEncoder (MAE) in self-supervised learning. Although MAE has achieved state-of-the-art performance in various downstream visual tasks, its underlying driving mechanism has not been studied as thoroughly as the classic contrastive learning paradigm. This paper introduces a local contrastive perspective, explicitly extracting the local contrastive form in MAE's reconstruction targets at the patch level, and proposes a new empirical framework—Local Contrastive MAE (LC-MAE)—to analyze the reconstruction and contrastive aspects of MAE. Specifically, the paper explores the following points: 1. **How MAE learns invariance to random masking**: Through the form of local contrast, MAE can maintain the consistency of local features under different random masks. 2. **How MAE ensures the distribution consistency between learned token embeddings and the original image**: This helps prevent model collapse. 3. **The role of the decoder**: The paper analyzes the learning mechanism of the decoder, finding that the shallow layers mainly utilize positional information, while the deeper layers gradually learn semantic information. 4. **The dual role of random masking**: As a means of data augmentation and a limitation of the effective receptive field, this is crucial for MAE's performance in downstream tasks. Through these analyses, the paper reveals the success factors of MAE and provides valuable insights and design methods for future research.

Understanding Masked Autoencoders From a Local Contrastive Perspective

Masked Autoencoders for Point Cloud Self-supervised Learning.

How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders

Rethinking Patch Dependence for Masked Autoencoders

Contrastive Masked Autoencoders are Stronger Vision Learners

ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders

Masked Autoencoders Are Scalable Vision Learners

Teaching Masked Autoencoder With Strong Augmentations

CL-MAE: Curriculum-Learned Masked Autoencoders

Improving Visual Representations of Masked Autoencoders With Artifacts Suppression

Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization

Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders

Masked Autoencoders in 3D Point Cloud Representation Learning

A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond

Stare at What You See: Masked Image Modeling Without Reconstruction

SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners

Masked Autoencoders are Efficient Class Incremental Learners

GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds

Efficient Masked Autoencoders with Self-Consistency

Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning