Abstract:This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT). While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging. In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. We observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. We reveal that these results are indeed partial failure, and they can be improved when training is made more stable. We benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects. We discuss the currently positive evidence as well as challenges and open questions. We hope that this work will provide useful data points and experience for future research.

What problem does this paper attempt to address?

The paper primarily explores the issues encountered when training Vision Transformers (ViT) within a self-supervised learning framework, particularly focusing on stability issues and their impact on model performance. The core issue of the paper is understanding how to effectively train self-supervised ViT models. Specifically, the researchers focus on the following aspects: 1. **Self-Supervised Learning and ViT**: While self-supervised learning is already very mature in convolutional neural networks, the training methods for ViT are not yet well-established, especially when training without labeled data. Therefore, the researchers aim to explore the effects of basic training components (such as batch size, learning rate, optimizer, etc.) under self-supervised learning for ViT. 2. **Instability Issues**: The study finds that a major issue when training self-supervised ViT is instability during the training process. This instability may not lead to complete training failure but can cause a slight drop in accuracy (e.g., 1%-3%). This level of decline might not be easily noticeable unless there is a more stable training baseline for comparison. 3. **Techniques to Improve Stability**: To mitigate the instability issues during training, the researchers propose a simple technique—freezing the patch projection layer in ViT (using fixed random patch projections). Experiments show that this method can significantly improve training stability and enhance the final model accuracy across different scenarios. 4. **Comparison with Other Frameworks**: Besides the MoCo framework, the researchers also explore several other popular self-supervised learning frameworks (such as SimCLR, BYOL, and SwAV) and observe similar instability issues. By using the aforementioned technique, the stability of these frameworks is improved, and model performance is enhanced. In summary, this paper aims to reveal the stability issues present in training ViT under self-supervised learning and proposes an effective solution to improve training stability. This has important implications for further advancing the application of self-supervised learning in the field of computer vision.

An Empirical Study of Training Self-Supervised Vision Transformers

A Closer Look at Self-Supervised Lightweight Vision Transformers

Self-supervised Models are Good Teaching Assistants for Vision Transformers.

Teaching Matters: Investigating the Role of Supervision in Vision Transformers

Effective Vision Transformer Training: A Data-Centric Perspective

What Do Self-Supervised Vision Transformers Learn?

Benchmarking Detection Transfer Learning with Vision Transformers

Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels

An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training

Training Object Detectors from Scratch: An Empirical Study in the Era of Vision Transformer

Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

DeiT III: Revenge of the ViT

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Improve Vision Transformers Training by Suppressing Over-smoothing

Semi-supervised Vision Transformers at Scale

MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

Training Vision Transformers with only 2040 Images.

Improving Vision Transformers for Incremental Learning

How to Train Vision Transformer on Small-scale Datasets?

On the Surprising Effectiveness of Attention Transfer for Vision Transformers