An Empirical Study of Training Self-Supervised Vision Transformers

Xinlei Chen,Saining Xie,Kaiming He
2021-08-17
Abstract:This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT). While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging. In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. We observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. We reveal that these results are indeed partial failure, and they can be improved when training is made more stable. We benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects. We discuss the currently positive evidence as well as challenges and open questions. We hope that this work will provide useful data points and experience for future research.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper primarily explores the issues encountered when training Vision Transformers (ViT) within a self-supervised learning framework, particularly focusing on stability issues and their impact on model performance. The core issue of the paper is understanding how to effectively train self-supervised ViT models. Specifically, the researchers focus on the following aspects: 1. **Self-Supervised Learning and ViT**: While self-supervised learning is already very mature in convolutional neural networks, the training methods for ViT are not yet well-established, especially when training without labeled data. Therefore, the researchers aim to explore the effects of basic training components (such as batch size, learning rate, optimizer, etc.) under self-supervised learning for ViT. 2. **Instability Issues**: The study finds that a major issue when training self-supervised ViT is instability during the training process. This instability may not lead to complete training failure but can cause a slight drop in accuracy (e.g., 1%-3%). This level of decline might not be easily noticeable unless there is a more stable training baseline for comparison. 3. **Techniques to Improve Stability**: To mitigate the instability issues during training, the researchers propose a simple technique—freezing the patch projection layer in ViT (using fixed random patch projections). Experiments show that this method can significantly improve training stability and enhance the final model accuracy across different scenarios. 4. **Comparison with Other Frameworks**: Besides the MoCo framework, the researchers also explore several other popular self-supervised learning frameworks (such as SimCLR, BYOL, and SwAV) and observe similar instability issues. By using the aforementioned technique, the stability of these frameworks is improved, and model performance is enhanced. In summary, this paper aims to reveal the stability issues present in training ViT under self-supervised learning and proposes an effective solution to improve training stability. This has important implications for further advancing the application of self-supervised learning in the field of computer vision.