Abstract: This paper introduces a new research problem of video domain generalization (video DG) where most state-of-the-art action recognition networks degenerate due to the lack of exposure to the target domains of divergent distributions. While recent advances in video understanding focus on capturing the temporal relations of the long-term video context, we observe that the global temporal features are less generalizable in the video DG settings. The reason is that videos from other unseen domains may have unexpected absence, misalignment, or scale transformation of the temporal relations, which is known as the temporal domain shift. Therefore, the video DG is even more challenging than the image DG, which is also under-explored, because of the entanglement of the spatial and temporal domain shifts. This finding has led us to view the key to video DG as how to effectively learn the local-relation features of different time scales that are more generalizable, and how to exploit them along with the global-relation features to maintain the discriminability. This paper presents the Adversarial Pyramid Network (APN), which captures the local-relation, global-relation, and multilayer cross-relation features progressively. This pyramid network not only improves the feature transferability from the view of representation learning, but also enhances the diversity and quality of the new data points that can bridge different domains when it is integrated with an improved version of the image DG adversarial data augmentation method. We construct four video DG benchmarks: UCF-HMDB, Something-Something, PKU-MMD, and NTU, in which the source and target domains are divided according to different datasets, different consequences of actions, or different camera views. The APN consistently outperforms previous action recognition models over all benchmarks.

AGPN: Action Granularity Pyramid Network for Video Action Recognition

AGPN: Action Granularity Pyramid Network for Video Action Recognition

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

A hybrid attention-guided ConvNeXt-GRU network for action recognition

Multipath Attention and Adaptive Gating Network for Video Action Recognition

Spatiotemporal Pyramid Network for Video Action Recognition

Deep multiple aggregation networks for action recognition

Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based Action Recognition

Temporal Pyramid Network for Action Recognition

Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition

Learning Hierarchical Video Representation for Action Recognition

Multi-stream P&U adaptive graph convolutional networks for skeleton-based action recognition

Adversarial Pyramid Network for Video Domain Generalization

Temporal adaptive feature pyramid network for action detection

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Hierarchical Feature Aggregation Networks for Video Action Recognition

Multi-scale residual network model combined with Global Average Pooling for action recognition

Interaction-Aware Spatio-Temporal Pyramid Attention Networks for Action Classification

Spatial-Temporal Hypergraph Neural Network based on Attention Mechanism for Multi-view Data Action Recognition

Efficient spatio-temporal network for action recognition