Attentive spatial-temporal contrastive learning for self-supervised video representation
Xingming Yang,Sixuan Xiong,Kewei Wu,Dongfeng Shan,Zhao Xie
DOI: https://doi.org/10.1016/j.imavis.2023.104765
IF: 3.86
2023-09-01
Image and Vision Computing
Abstract:Most existing self-supervised works learn video representation by using a single pretext task. A single pretext task, providing single supervision from unlabeled data, may neglect to describe the difference between spatial features and temporal features. The similar spatial features and temporal features may hinder distinguishing between two similar videos with different class labels. In this paper, we propose an attentive spatial–temporal contrastive learning network (ASTCNet), which learns self-attention spatial–temporal features by contrastive learning between multiple spatial and temporal pretext tasks. The spatial features are learned by multiple spatial pretext tasks, including spatial rotation, and spatial jigsaw. Each spatial feature is enhanced with spatial self-attention by learning the relation between patches. The temporal features are learned by multiple temporal pretext tasks, including temporal order, and temporal pace. Each temporal feature is enhanced with temporal self-attention by learning the relation between frames, and is enhanced by feeding the optical flow features into a motion module. To separate the spatial feature and the temporal feature learned in one video, we represent the video as different features for each pretext task, and design pretext task-based contrastive loss. The pretext task-based contrastive loss encourages the different pretext tasks to learn dissimilar features, and encourages the same pretext task to learn similar features. The pretext task-based contrastive loss can learn the discriminative features for each pretext task in one video. The experiments show that our method achieves state-of-the-art performance for self-supervised action recognition on the UCF101 dataset and the HMDB51 dataset.
computer science, artificial intelligence, theory & methods,engineering, electrical & electronic, software engineering,optics