Abstract:Self-supervised learning has become a cornerstone in computer vision, primarily divided into reconstruction-based methods like masked autoencoders (MAE) and discriminative methods such as contrastive learning (CL). Recent empirical observations reveal that MAE and CL capture different types of representations: CL tends to focus on global patterns, while MAE adeptly captures both global and subtle local information simultaneously. Despite a flurry of recent empirical investigations to shed light on this difference, theoretical understanding remains limited, especially on the dominant architecture vision transformers (ViTs). In this paper, to provide rigorous insights, we model the visual data distribution by considering two types of spatial features: dominant global features and comparatively minuscule local features, and study the impact of imbalance among these features. We analyze the training dynamics of one-layer softmax-based ViTs on both MAE and CL objectives using gradient descent. Our analysis shows that as the degree of feature imbalance varies, ViTs trained with the MAE objective effectively learn both global and local features to achieve near-optimal reconstruction, while the CL-trained ViTs favor predominantly global features, even under mild imbalance. These results provide a theoretical explanation for distinct behaviors of MAE and CL observed in empirical studies.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to theoretically analyze the different behaviors of self - supervised learning (SSL) in Vision Transformers (ViTs), especially the differences between the two mainstream methods, contrastive learning (CL) and masked auto - encoder (MAE). Specifically, the paper focuses on the following points: 1. **Understanding the learning mechanisms of different SSL methods**: - The paper aims to theoretically explain why MAE can learn diverse attention patterns while CL tends to focus on global features. - By modeling the distribution of visual data and considering two types of spatial features: the dominant global features and the relatively small local features, it studies the impact of these feature imbalances on the training process. 2. **Providing theoretical support**: - The paper provides strict mathematical proofs, showing that when the degree of feature imbalance changes, ViTs trained with the MAE objective function can effectively learn global and local features to achieve near - optimal reconstruction, while ViTs trained with the CL objective function are more likely to focus on global features, even in the case of slight imbalance. 3. **Filling the theoretical gap**: - Although a large number of empirical studies have revealed the different performances of MAE and CL in visual pre - training, the theoretical understanding is still limited, especially on the ViTs architecture. - By introducing concepts such as information gap, the paper provides theoretical explanations for these empirical observations and fills the theoretical gap in this field. 4. **Contributions and innovation points**: - It provides guarantees on the global convergence of ViTs under MAE and CL loss functions, which is the first end - to - end guarantee for training ViTs with self - supervised learning objectives. - It details the training dynamics of attention correlations and reveals how MAE and CL generate different attention patterns during the training process. In summary, the core problem of this paper is to explain the different behaviors of MAE and CL in Vision Transformers in self - supervised learning through theoretical analysis and provide strict mathematical proofs and support for them. This not only helps to deeply understand the working principles of these two methods but also provides an important theoretical basis for future research.

A Theoretical Analysis of Self-Supervised Learning for Vision Transformers

What Do Self-Supervised Vision Transformers Learn?

An Empirical Study of Training Self-Supervised Vision Transformers

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

Visualizing the loss landscape of Self-supervised Vision Transformer

Analyzing Local Representations of Self-supervised Vision Transformers

A Closer Look at Self-Supervised Lightweight Vision Transformers

Teaching Matters: Investigating the Role of Supervision in Vision Transformers

Attention-Guided Contrastive Masked Image Modeling for Transformer-Based Self-Supervised Learning

Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations

Exploiting Temporal Coherence for Self-Supervised Visual Tracking by Using Vision Transformer

Long-Short Temporal Contrastive Learning of Video Transformers

Do Vision Transformers See Like Convolutional Neural Networks?

How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization

Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

Analyzing Vision Transformers for Image Classification in Class Embedding Space

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

Patch-level Representation Learning for Self-supervised Vision Transformers

Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels

Vision Transformers: State of the Art and Research Challenges