On the Surprising Effectiveness of Attention Transfer for Vision Transformers

Alexander C. Li,Yuandong Tian,Beidi Chen,Deepak Pathak,Xinlei Chen
2024-11-15
Abstract:Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet. We systematically study various aspects of our findings on the sufficiency of attention maps, including distribution shift settings where they underperform fine-tuning. We hope our exploration provides a better understanding of what pre-training accomplishes and leads to a useful alternative to the standard practice of fine-tuning
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition,Neural and Evolutionary Computing
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper "The Surprising Effectiveness of Attention Transfer in Vision Transformers" attempts to explore whether the primary role of pre-training Vision Transformers (ViT) is truly to improve downstream task performance by learning useful features. The authors propose a hypothesis: the main role of pre-training might be to teach the model how to transfer information between tokens, rather than learning specific features. Specifically, the paper introduces a new method—**Attention Transfer**—to test this hypothesis. This method only transfers the attention patterns (i.e., how information flows between tokens) from the pre-trained model to a new model, which needs to learn its own features from scratch. The authors demonstrate through experiments that by using only the pre-trained attention patterns, the new model can learn high-quality features and achieve performance on downstream tasks comparable to full fine-tuning. ### Main Contributions 1. **Detailed Analysis of the Sufficiency of Attention Patterns**: - The study finds that using only pre-trained attention patterns is often sufficient to achieve downstream performance comparable to fine-tuning on ImageNet-1K. - Experimental results also show that integrating attention transfer with fine-tuning can significantly improve ImageNet performance. - These findings challenge the common view that pre-training is primarily for feature learning. 2. **Attention Transfer Methods**: - Two attention transfer methods are introduced: **Attention Copy** and **Attention Distillation**. - Attention Copy directly "copy-pastes" attention patterns from the pre-trained model, while Attention Distillation allows the student model to learn the teacher model's attention patterns during training. - These methods help understand the roles of learned features and attention patterns during pre-training. ### Experimental Results - **Attention Copy**: In the ImageNet-1K classification task, Attention Copy significantly narrows the performance gap between training from scratch and full fine-tuning, achieving an accuracy of 85.1%. - **Attention Distillation**: Attention Distillation can even fully match the performance of full fine-tuning, reaching an accuracy of 85.7%. - **Integration Effect**: Integrating the Attention Distillation model with the fine-tuned model further improves performance to 86.3%, a 0.6% increase over using the fine-tuned model alone. ### Analysis and Discussion - **Importance of Different Activations, Layers, and Heads**: The study examines different variants of attention transfer, including transferring only parts of Q, K, or V, and transferring only certain layers or heads. Results indicate that transferring more layers and heads is generally more beneficial, but performance saturates at 12 heads. - **Re-learning the Teacher Model**: Through CKA metrics and integration accuracy analysis, it is found that the attention transfer model and the fine-tuned model learn significantly different representations, indicating that the student model is not simply re-learning the teacher model's representations. ### Conclusion This paper challenges the traditional view that pre-training is primarily for feature learning by introducing the attention transfer method. The study shows that pre-trained attention patterns alone are sufficient to guide new models in learning high-quality features and achieving good performance on downstream tasks. This provides a new alternative for utilizing pre-trained models, especially in scenarios where weight sharing poses security risks.