Self-Supervised Learning for Videos: A Survey

Madeline C. Schiappa,Yogesh S. Rawat,Mubarak Shah
DOI: https://doi.org/10.1145/3577925
2023-07-20
Abstract:The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires great effort, which is especially challenging for videos. Moreover, the use of human-generated annotations leads to models with biased learning and poor domain generalization and robustness. As an alternative, self-supervised learning provides a way for representation learning which does not require annotations and has shown promise in both image and video domains. Different from the image domain, learning video representations are more challenging due to the temporal dimension, bringing in motion and other environmental dynamics. This also provides opportunities for video-exclusive ideas that advance self-supervised learning in the video and multimodal domain. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the issue that deep learning faces in the field of video processing, namely, the labeled data is expensive and difficult to obtain. Specifically, the paper points out that although deep learning has achieved remarkable success in multiple fields, this success largely depends on the availability of large - scale labeled data sets. However, for video data, obtaining high - quality labels is very expensive and labor - intensive, which has become a major challenge. In addition, artificially generated labels may cause the model to learn biased knowledge and have poor generalization ability and robustness in different fields. Therefore, the paper explores self - supervised learning (SSL) as an alternative, aiming to perform representation learning without relying on labeled data, thereby overcoming the above problems. Self - supervised learning can not only reduce the dependence on labeled data, but also improve the model's generalization ability and robustness, especially showing great potential in the fields of images and videos. The paper pays special attention to self - supervised learning methods in the video field. Because compared with images, video representation learning is more challenging and needs to consider the dynamic changes brought by the time dimension, such as changes in motion and other environmental factors. These challenges also provide opportunities for video - specific self - supervised learning methods and promote the development of the video and multi - modal fields. The paper reviews the existing self - supervised learning methods and divides them into four categories: 1) pretext tasks, 2) generative learning, 3) contrastive learning, 4) cross - modal agreement. At the same time, the paper also introduces commonly used data sets, downstream evaluation tasks, the limitations of existing work and future research directions.