Abstract:- Action recognition is a popular research topic in the computer vision and machine learning domains. Although many action recognition methods have been proposed, only a few researchers have focused on cross-domain few-shot action recognition, which must often be performed in real security surveillance. Since the problems of action recognition, domain adaptation, and few-shot learning need to be simultaneously solved, the cross-domain few-shot action recognition task is a challenging problem. To solve these issues, in this work, we develop a novel end-to-end pairwise attentive adversarial spatiotemporal network (PASTN) to perform the cross-domain few-shot action recognition task, in which spatiotemporal information acquisition, few-shot learning, and video domain adaptation are realised in a unified framework. Specifically, the Resnet-50 network is selected as the backbone of the PASTN, and a 3D convolution block is embedded in the top layer of the 2D CNN (ResNet-50) to capture the spatiotemporal representations. Moreover, a novel attentive adversarial network architecture is designed to align the spatiotemporal dynamics actions with higher domain discrepancies. In addition, the pairwise margin discrimination loss is designed for the pairwise network architecture to improve the discrimination of the learned domain-invariant spatiotemporal feature. The results of extensive experiments performed on three public benchmarks of the cross-domain action recognition datasets, including SDAI Action I, SDAI Action II and UCF50-OlympicSport, demonstrate that the proposed PASTN can significantly outperform the state-of-the-art cross-domain action recognition methods in terms of both the accuracy and computational time. Even when only two labelled training samples per category are considered in the office1 scenario of the SDAI Action I dataset, the accuracy of the PASTN is improved by 6.1%, 10.9%, 16.8%, and 14% compared to that of the TA3N , TemporalPooling, I3D, and P3D methods, respectively.

SCaTNet: A Novel Self-supervised Contrastive Framework with Spatial-Channel Attention and Temporal Transformer for Few-Shot Action Recognition.

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

On the Importance of Spatial Relations for Few-shot Action Recognition

SSTA-Net: Self-supervised Spatio-Temporal Attention Network for Action Recognition.

Semantic-guided spatio-temporal attention for few-shot action recognition

Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action Recognition

A Pairwise Attentive Adversarial Spatiotemporal Network for Cross-Domain Few-Shot Action Recognition-R2.

STCA: an action recognition network with spatio-temporal convolution and attention

Few-shot Action Recognition via Improved Attention with Self-supervision

Cross-Modal Contrastive Learning Network for Few-Shot Action Recognition

A Novel Action Saliency and Context-Aware Network for Weakly-Supervised Temporal Action Localization

Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition

Task-Aware Dual-Representation Network for Few-Shot Action Recognition

Spatio-Temporal Adaptive Network with Bidirectional Temporal Difference for Action Recognition

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

Cross-modal Guides Spatio-Temporal Enrichment Network for Few-Shot Action Recognition

Temporal Transformer Networks with Self-Supervision for Action Recognition.

Task-specific alignment and multiple-level transformer for few-shot action recognition

Short-Term Action Learning for Video Action Recognition