Abstract:Few-shot action recognition is a challenging problem aimed at learning a model capable of adapting to recognize new categories using only a few labeled videos. Recently, some works use attention mechanisms to focus on relevant regions to obtain discriminative representations. Despite the significant progress, these methods still cannot attain outstanding performance due to insufficient examples and a scarcity of additional supplementary information. In this paper, we propose a novel Semantic-guided Spatio-temporal Attention (SGSTA) approach for few-shot action recognition. The main idea of SGSTA is to exploit the semantic information contained in the text embedding of labels to guide attention to more accurately capture the rich spatio-temporal context in videos when visual content is insufficient. Specifically, SGSTA comprises two essential components: a visual-text alignment module and a semantic-guided spatio-temporal attention module. The former is used to align visual features and text embeddings to eliminate semantic gaps between them. The latter is further divided into spatial attention and temporal attention. Firstly, a semantic-guided spatial attention is applied on the frame feature map to focus on semantically relevant spatial regions. Then, a semantic-guided temporal attention is used to encode the semantically enhanced temporal context with a temporal Transformer. Finally, use the spatio-temporally contextual representation obtained to learn relationship matching between support and query sequences. In this way, SGSTA can fully utilize rich semantic priors in label embeddings to improve class-specific discriminability and achieve accurate few-shot recognition. Comprehensive experiments on four challenging benchmarks demonstrate that the proposed SGSTA is effective and achieves competitive performance over existing state-of-the-art methods under various settings.

Spatio-Temporal Self-supervision for Few-Shot Action Recognition.

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Semantic-guided spatio-temporal attention for few-shot action recognition

On the Importance of Spatial Relations for Few-shot Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Few-shot Action Recognition via Improved Attention with Self-supervision

Cross-modal Guides Spatio-Temporal Enrichment Network for Few-Shot Action Recognition

Task-Agnostic Self-Distillation for Few-Shot Action Recognition

Elastic Temporal Alignment for Few‐shot Action Recognition

Few-shot action recognition with implicit temporal alignment and pair similarity optimization

Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action Recognition

Convolutional Self-attention Guided Graph Neural Network for Few-Shot Action Recognition.

SSTA-Net: Self-supervised Spatio-Temporal Attention Network for Action Recognition.

Cross-domain few-shot action recognition with unlabeled videos

Trajectory-aligned Space-time Tokens for Few-shot Action Recognition

Learning Causal Domain-Invariant Temporal Dynamics for Few-Shot Action Recognition

Short-Term Action Learning for Video Action Recognition

Two-Stream Temporal Feature Aggregation Based on Clustering for Few-Shot Action Recognition

D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

Learning Spatial-Preserved Skeleton Representations for Few-Shot Action Recognition.