ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

Sharana Dharshikgan Suresh Dass,Hrishav Bakul Barua,Ganesh Krishnasamy,Raveendran Paramesran,Raphael C.-W. Phan

2024-04-09

Abstract:Human action or activity recognition in videos is a fundamental task in computer vision with applications in surveillance and monitoring, self-driving cars, sports analytics, human-robot interaction and many more. Traditional supervised methods require large annotated datasets for training, which are expensive and time-consuming to acquire. This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition. Our framework leverages both labeled and unlabelled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples. We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) are utilised to capture different aspects of action representations; hence we call it ActNetFormer. The 3D CNNs excel at capturing spatial features and local dependencies in the temporal domain, while VIT excels at capturing long-range dependencies across frames. By integrating these complementary architectures within the ActNetFormer framework, our approach can effectively capture both local and global contextual information of an action. This comprehensive representation learning enables the model to achieve better performance in semi-supervised action recognition tasks by leveraging the strengths of each of these architectures. Experimental results on standard action recognition datasets demonstrate that our approach performs better than the existing methods, achieving state-of-the-art performance with only a fraction of labeled data. The official website of this work is available at:

Computer Vision and Pattern Recognition,Artificial Intelligence,Human-Computer Interaction,Machine Learning,Multimedia

What problem does this paper attempt to address?

The paper proposes a new method called ActNetFormer for semi-supervised action recognition in videos. Traditional supervised learning methods require a large amount of annotated data, while ActNetFormer combines cross-architecture pseudo-labeling with contrastive learning to improve the performance of video action recognition using a small amount of labeled data and a large amount of unlabeled data. It combines 3D convolutional neural networks (3D CNNs) and Video Transformers (VIT) to capture the local and global contextual information of actions, respectively. 3D CNNs are good at capturing spatial features and local dependencies in the temporal domain, while VIT excels at capturing long-term dependencies across frames. Through the ActNetFormer framework, these two complementary architectures are integrated to comprehensively learn the representation of actions. Additionally, the paper introduces cross-architecture contrastive learning to enhance the alignment and mutual information of representations extracted from different architectures. Experimental results show that ActNetFormer outperforms existing methods on standard action recognition datasets, achieving state-of-the-art results even with limited labeled data. This indicates that ActNetFormer effectively utilizes information from unlabeled videos and improves the effectiveness of semi-supervised learning in video action recognition tasks.

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

SITAR: Semi-supervised Image Transformer for Action Recognition

SVFormer: Semi-supervised Video Transformer for Action Recognition

Human Action Recognition From Digital Videos Based on Deep Learning.

Convolutional transformer network for fine-grained action recognition

Human Action Recognition Using Deep Learning Methods.

ActionFormer: Localizing Moments of Actions with Transformers

A unified framework for unsupervised action learning via global-to-local motion transformer

Efficient Action Recognition with Introducing R(2+1)D Convolution to Improved Transformer

Modeling transformer architecture with attention layer for human activity recognition

Human-Centric Transformer for Domain Adaptive Action Recognition

Dense Semantics-Assisted Networks For Video Action Recognition

EventTransAct: A video transformer-based framework for Event-camera based action recognition

Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition

Cross-Modal Learning with 3D Deformable Attention for Action Recognition

RNNs, CNNs and Transformers in Human Action Recognition: A Survey and a Hybrid Model

View-Robust Neural Networks for Unseen Human Action Recognition in Videos

An Effective-Efficient Approach for Dense Multi-Label Action Detection

Deep set conditioned latent representations for action recognition

Action recognition method based on a novel keyframe extraction method and enhanced 3D convolutional neural network