A Theoretical Analysis of Self-Supervised Learning for Vision Transformers

Yu Huang,Zixin Wen,Yuejie Chi,Yingbin Liang
2025-02-05
Abstract:Self-supervised learning has become a cornerstone in computer vision, primarily divided into reconstruction-based methods like masked autoencoders (MAE) and discriminative methods such as contrastive learning (CL). Recent empirical observations reveal that MAE and CL capture different types of representations: CL tends to focus on global patterns, while MAE adeptly captures both global and subtle local information simultaneously. Despite a flurry of recent empirical investigations to shed light on this difference, theoretical understanding remains limited, especially on the dominant architecture vision transformers (ViTs). In this paper, to provide rigorous insights, we model the visual data distribution by considering two types of spatial features: dominant global features and comparatively minuscule local features, and study the impact of imbalance among these features. We analyze the training dynamics of one-layer softmax-based ViTs on both MAE and CL objectives using gradient descent. Our analysis shows that as the degree of feature imbalance varies, ViTs trained with the MAE objective effectively learn both global and local features to achieve near-optimal reconstruction, while the CL-trained ViTs favor predominantly global features, even under mild imbalance. These results provide a theoretical explanation for distinct behaviors of MAE and CL observed in empirical studies.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to theoretically analyze the different behaviors of self - supervised learning (SSL) in Vision Transformers (ViTs), especially the differences between the two mainstream methods, contrastive learning (CL) and masked auto - encoder (MAE). Specifically, the paper focuses on the following points: 1. **Understanding the learning mechanisms of different SSL methods**: - The paper aims to theoretically explain why MAE can learn diverse attention patterns while CL tends to focus on global features. - By modeling the distribution of visual data and considering two types of spatial features: the dominant global features and the relatively small local features, it studies the impact of these feature imbalances on the training process. 2. **Providing theoretical support**: - The paper provides strict mathematical proofs, showing that when the degree of feature imbalance changes, ViTs trained with the MAE objective function can effectively learn global and local features to achieve near - optimal reconstruction, while ViTs trained with the CL objective function are more likely to focus on global features, even in the case of slight imbalance. 3. **Filling the theoretical gap**: - Although a large number of empirical studies have revealed the different performances of MAE and CL in visual pre - training, the theoretical understanding is still limited, especially on the ViTs architecture. - By introducing concepts such as information gap, the paper provides theoretical explanations for these empirical observations and fills the theoretical gap in this field. 4. **Contributions and innovation points**: - It provides guarantees on the global convergence of ViTs under MAE and CL loss functions, which is the first end - to - end guarantee for training ViTs with self - supervised learning objectives. - It details the training dynamics of attention correlations and reveals how MAE and CL generate different attention patterns during the training process. In summary, the core problem of this paper is to explain the different behaviors of MAE and CL in Vision Transformers in self - supervised learning through theoretical analysis and provide strict mathematical proofs and support for them. This not only helps to deeply understand the working principles of these two methods but also provides an important theoretical basis for future research.