How Transformers Learn Diverse Attention Correlations in Masked Vision Pretraining

Yu Huang,Zixin Wen,Yuejie Chi,Yingbin Liang
2024-06-05
Abstract:Masked reconstruction, which predicts randomly masked patches from unmasked ones, has emerged as an important approach in self-supervised pretraining. However, the theoretical understanding of masked pretraining is rather limited, especially for the foundational architecture of transformers. In this paper, to the best of our knowledge, we provide the first end-to-end theoretical guarantee of learning one-layer transformers in masked reconstruction self-supervised pretraining. On the conceptual side, we posit a mechanism of how transformers trained with masked vision pretraining objectives produce empirically observed local and diverse attention patterns, on data distributions with spatial structures that highlight feature-position correlations. On the technical side, our end-to-end characterization of training dynamics in softmax-attention models simultaneously accounts for input and position embeddings, which is developed based on a careful analysis tracking the interplay between feature-wise and position-wise attention correlations.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
The paper attempts to address the problem of how to theoretically explain the learning of diverse attention patterns by Transformers during Masked Vision Pretraining. Specifically, the authors focus on: 1. **Theoretical Solution**: How to theoretically describe the solutions to which Transformers converge during Masked Vision Pretraining and demonstrate how these solutions result in non-collapsing and diverse local attention patterns. 2. **Mechanism of Learning Attention Patterns**: How Transformers learn diverse local attention patterns during pretraining rather than focusing on global object attention. ### Background of the Paper Self-supervised learning has achieved significant success in natural language processing (NLP), such as BERT and GPT. In the field of computer vision, self-supervised learning initially focused mainly on discriminative methods, such as contrastive learning and non-contrastive learning. However, inspired by masked language models and the successful implementation of Vision Transformers (ViTs), generative methods like masked reconstruction have become increasingly popular in self-supervised vision pretraining. ### Research Motivation Despite a large number of empirical studies dedicated to exploring Masked Vision Pretraining, its theoretical understanding remains very limited. Most existing theoretical research focuses on discriminative methods, such as contrastive learning. For transformer-based masked image modeling methods, current theoretical research is scarce, leaving an important gap. ### Main Contributions 1. **Global Convergence Guarantee**: The authors provide a global convergence guarantee for the masked reconstruction loss and demonstrate how attention is distributed upon convergence, thereby proving that masked pretraining can learn diverse local attention patterns. 2. **Training Dynamics Analysis of Attention Correlations**: The authors analyze the training dynamics of attention correlations and demonstrate that Transformers can capture the desired diverse local patterns by learning feature-position correlations, regardless of whether these features are global or local. 3. **Attention Diversity Metric**: The authors design a new empirical metric—the Attention Diversity Metric—to probe Vision Transformers trained by different methods. Experimental results further confirm that masked image modeling can learn diverse local patterns. ### Comparison with Previous Work - **[JSL22]**: First characterized the training dynamics of Transformers in supervised learning but was limited to simple visual data distributions, assuming only position-position correlations, which is unrealistic in actual visual datasets. - **[PZS22]**: Analyzed the feature learning process of MAE using CNN architectures but failed to capture recent findings on self-attention, namely that masked pretraining can learn diverse local patterns. ### Conclusion Through theoretical analysis and experimental validation, this paper reveals the mechanism by which Transformers learn diverse local attention patterns during Masked Vision Pretraining, filling the gap in existing theoretical research. This achievement not only helps to deepen the understanding of Transformer behavior in self-supervised learning but also provides a theoretical foundation for future research.