Abstract:The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires great effort, which is especially challenging for videos. Moreover, the use of human-generated annotations leads to models with biased learning and poor domain generalization and robustness. As an alternative, self-supervised learning provides a way for representation learning which does not require annotations and has shown promise in both image and video domains. Different from the image domain, learning video representations are more challenging due to the temporal dimension, bringing in motion and other environmental dynamics. This also provides opportunities for video-exclusive ideas that advance self-supervised learning in the video and multimodal domain. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the issue that deep learning faces in the field of video processing, namely, the labeled data is expensive and difficult to obtain. Specifically, the paper points out that although deep learning has achieved remarkable success in multiple fields, this success largely depends on the availability of large - scale labeled data sets. However, for video data, obtaining high - quality labels is very expensive and labor - intensive, which has become a major challenge. In addition, artificially generated labels may cause the model to learn biased knowledge and have poor generalization ability and robustness in different fields. Therefore, the paper explores self - supervised learning (SSL) as an alternative, aiming to perform representation learning without relying on labeled data, thereby overcoming the above problems. Self - supervised learning can not only reduce the dependence on labeled data, but also improve the model's generalization ability and robustness, especially showing great potential in the fields of images and videos. The paper pays special attention to self - supervised learning methods in the video field. Because compared with images, video representation learning is more challenging and needs to consider the dynamic changes brought by the time dimension, such as changes in motion and other environmental factors. These challenges also provide opportunities for video - specific self - supervised learning methods and promote the development of the video and multi - modal fields. The paper reviews the existing self - supervised learning methods and divides them into four categories: 1) pretext tasks, 2) generative learning, 3) contrastive learning, 4) cross - modal agreement. At the same time, the paper also introduces commonly used data sets, downstream evaluation tasks, the limitations of existing work and future research directions.

Self-Supervised Learning for Videos: A Survey

Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos

Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey

A Survey on Self-Supervised Representation Learning

Self-Supervised Multimodal Learning: A Survey

Self-supervised Learning: Generative or Contrastive

Self-Supervised Representation Learning for Videos by Segmenting Via Sampling Rate Order Prediction

Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive Survey

Scaling and Benchmarking Self-Supervised Visual Representation Learning

Self-Supervised Representation Learning for Visual Anomaly Detection

Watching Too Much Television is Good: Self-Supervised Audio-Visual Representation Learning from Movies and TV Shows

Hierarchical Self-supervised Representation Learning for Movie Understanding

A Survey on Contrastive Self-Supervised Learning

Self-Supervised Representation Learning: Introduction, advances, and challenges

No More Shortcuts: Realizing the Potential of Temporal Self-Supervision

Self-Supervised Video Representation Learning with Motion-Contrastive Perception

Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data

Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision

Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

Self-supervised Learning: A Succinct Review