Abstract:Deep supervised learning models require high volume of labeled data to attain sufficiently good results. Although, the practice of gathering and annotating such big data is costly and laborious. Recently, the application of self supervised learning (SSL) in vision tasks has gained significant attention. The intuition behind SSL is to exploit the synchronous relationships within the data as a form of self-supervision, which can be versatile. In the current big data era, most of the data is unlabeled, and the success of SSL thus relies in finding ways to utilize this vast amount of unlabeled data available. Thus it is better for deep learning algorithms to reduce reliance on human supervision and instead focus on self-supervision based on the inherent relationships within the data. With the advent of ViTs, which have achieved remarkable results in computer vision, it is crucial to explore and understand the various SSL mechanisms employed for training these models specifically in scenarios where there is limited labelled data available. In this survey, we develop a comprehensive taxonomy of systematically classifying the SSL techniques based upon their representations and pre-training tasks being applied. Additionally, we discuss the motivations behind SSL, review popular pre-training tasks, and highlight the challenges and advancements in this field. Furthermore, we present a comparative analysis of different SSL methods, evaluate their strengths and limitations, and identify potential avenues for future research.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore and summarize the application mechanism of self - supervised learning (SSL) in vision transformers (ViTs). Specifically, the paper mainly focuses on the following aspects: 1. **Reducing the dependence on labeled data**: - Deep - supervised learning models require a large amount of labeled data to achieve good results, and the cost of obtaining and labeling these data is high and time - consuming. Therefore, researchers hope to use a large amount of unlabeled data through self - supervised learning methods to reduce the dependence on manually labeled data. 2. **Improving the performance of ViTs**: - Vision transformers (ViTs) perform well in computer vision tasks, especially in dealing with global context and long - distance dependencies. However, ViTs need a large amount of data to learn effective representations. Through self - supervised learning, large - scale unlabeled data can be used for pre - training, thereby improving the performance of ViTs in downstream tasks. 3. **Exploring different mechanisms of SSL**: - The paper systematically classifies and discusses various self - supervised learning techniques, including contrastive learning, generative learning, clustering, knowledge distillation, and hybrid SSL methods. Through this classification, the author hopes to provide researchers with a comprehensive framework to understand the advantages and disadvantages of different SSL methods and their applications in ViTs. 4. **Coping with real - world challenges**: - In practical applications, data is often limited and unbalanced. The paper explores how to use self - supervised learning methods in these situations, especially when labeled data is scarce or expensive, how to improve the robustness and generalization ability of the model through self - supervised learning. 5. **Future research directions**: - The paper also points out the challenges currently faced in the field of self - supervised learning and proposes future research directions, such as improving existing SSL methods, developing new pre - training tasks, and exploring multi - modal self - supervised learning, etc. In summary, the goal of this paper is to provide researchers with a comprehensive reference framework by systematically reviewing and analyzing the application of self - supervised learning in vision transformers, helping them make better use of unlabeled data in computer vision tasks and improve model performance.

A Survey of the Self Supervised Learning Mechanisms for Vision Transformers

Self-supervised Learning on Graphs: Contrastive, Generative,or Predictive

Self-supervised visual learning in the low-data regime: a comparative evaluation

A Survey on Self-supervised Learning: Algorithms, Applications, and Future Trends

Semi-Supervised and Unsupervised Deep Visual Learning: A Survey

A Survey of Self-Supervised Learning from Multiple Perspectives: Algorithms, Theory, Applications and Future Trends

Consequential Advancements of Self-Supervised Learning (SSL) in Deep Learning Contexts

Semi-supervised Vision Transformers at Scale

Self-Supervised Learning for Real-World Object Detection: a Survey

Self-supervised Learning: A Succinct Review

Progress and Thinking on Self-Supervised Learning Methods in Computer Vision: A Review

In-Domain Self-Supervised Learning Improves Remote Sensing Image Scene Classification

Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey

Self-Supervised Learning in Remote Sensing: A review

Self-Supervised Learning for Time Series Analysis: Taxonomy, Progress, and Prospects

Self-Supervised Learning on MeerKAT Wide-Field Continuum Images

Probing the Mid-level Vision Capabilities of Self-Supervised Learning

Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders

Self-supervised visual learning from interactions with objects

Computer Vision Self-supervised Learning Methods on Time Series