Collaboratively Self-supervised Video Representation Learning for Action Recognition

Jie Zhang,Zhifan Wan,Lanqing Hu,Stephen Lin,Shuzhe Wu,Shiguang Shan

2024-01-15

Abstract:Considering the close connection between action recognition and human pose estimation, we design a Collaboratively Self-supervised Video Representation (CSVR) learning framework specific to action recognition by jointly considering generative pose prediction and discriminative context matching as pretext tasks. Specifically, our CSVR consists of three branches: a generative pose prediction branch, a discriminative context matching branch, and a video generating branch. Among them, the first one encodes dynamic motion feature by utilizing Conditional-GAN to predict the human poses of future frames, and the second branch extracts static context features by pulling the representations of clips and compressed key frames from the same video together while pushing apart the pairs from different videos. The third branch is designed to recover the current video frames and predict the future ones, for the purpose of collaboratively improving dynamic motion features and static context features. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the UCF101 and HMDB51 datasets.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively perform action recognition without a large amount of labeled data. Specifically, the paper proposes a framework named Collaboratively Self - supervised Video Representation (CSVR), aiming to improve the performance of action recognition by jointly learning dynamic motion features and static context features. This method particularly focuses on using human pose prediction and context matching as pre - training tasks to generate more comprehensive video representations, thereby achieving better performance in downstream tasks. The main contributions of the paper include: 1. Proposing a novel framework, CSVR, which simultaneously learns dynamic motion features and static context features through three branches, namely the generative pose prediction branch, the discriminative context - matching branch, and the collaborative video generation branch. 2. Designing a generative pose prediction branch, which predicts future poses from the current pose sequence through a Conditional Generative Adversarial Network (CGAN), effectively extracting dynamic motion features. 3. Through the elaborately - designed collaborative video generation branch, achieving the joint optimization of dynamic motion features and static context features, thus providing a more comprehensive representation for downstream action - understanding tasks. Through these innovations, CSVR has achieved state - of - the - art performance on the UCF101 and HMDB51 datasets, proving its effectiveness in the field of action recognition.

Collaboratively Self-supervised Video Representation Learning for Action Recognition

Knowledge-guided Pre-Training and Fine-Tuning: Video Representation Learning for Action Recognition

Self-supervised pretext task collaborative multi-view contrastive learning for video action recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Online Robust Action Recognition Based on a Hierarchical Model

Joint Action Recognition And Pose Estimation From Video

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Collaborative Spatio-temporal Feature Learning for Video Action Recognition

Learning Hierarchical Video Representation for Action Recognition

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Contrast-reconstruction Representation Learning for Self-supervised Skeleton-based Action Recognition

An Approach to Pose-Based Action Recognition

Learning Comprehensive Motion Representation for Action Recognition

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

Self-organizing neural integration of pose-motion features for human action recognition

Multi-Task Learning of Generalizable Representations for Video Action Recognition

Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition

Learning Latent Spatio-Temporal Compositional Model for Human Action Recognition

Hierarchically Learned View-Invariant Representations for Cross-View Action Recognition

Unsupervised Learning of View-invariant Action Representations