Collaboratively Self-supervised Video Representation Learning for Action Recognition

Jie Zhang,Zhifan Wan,Lanqing Hu,Stephen Lin,Shuzhe Wu,Shiguang Shan
2024-01-15
Abstract:Considering the close connection between action recognition and human pose estimation, we design a Collaboratively Self-supervised Video Representation (CSVR) learning framework specific to action recognition by jointly considering generative pose prediction and discriminative context matching as pretext tasks. Specifically, our CSVR consists of three branches: a generative pose prediction branch, a discriminative context matching branch, and a video generating branch. Among them, the first one encodes dynamic motion feature by utilizing Conditional-GAN to predict the human poses of future frames, and the second branch extracts static context features by pulling the representations of clips and compressed key frames from the same video together while pushing apart the pairs from different videos. The third branch is designed to recover the current video frames and predict the future ones, for the purpose of collaboratively improving dynamic motion features and static context features. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the UCF101 and HMDB51 datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively perform action recognition without a large amount of labeled data. Specifically, the paper proposes a framework named Collaboratively Self - supervised Video Representation (CSVR), aiming to improve the performance of action recognition by jointly learning dynamic motion features and static context features. This method particularly focuses on using human pose prediction and context matching as pre - training tasks to generate more comprehensive video representations, thereby achieving better performance in downstream tasks. The main contributions of the paper include: 1. Proposing a novel framework, CSVR, which simultaneously learns dynamic motion features and static context features through three branches, namely the generative pose prediction branch, the discriminative context - matching branch, and the collaborative video generation branch. 2. Designing a generative pose prediction branch, which predicts future poses from the current pose sequence through a Conditional Generative Adversarial Network (CGAN), effectively extracting dynamic motion features. 3. Through the elaborately - designed collaborative video generation branch, achieving the joint optimization of dynamic motion features and static context features, thus providing a more comprehensive representation for downstream action - understanding tasks. Through these innovations, CSVR has achieved state - of - the - art performance on the UCF101 and HMDB51 datasets, proving its effectiveness in the field of action recognition.