Abstract:Few-shot action recognition is a challenging problem aimed at learning a model capable of adapting to recognize new categories using only a few labeled videos. Recently, some works use attention mechanisms to focus on relevant regions to obtain discriminative representations. Despite the significant progress, these methods still cannot attain outstanding performance due to insufficient examples and a scarcity of additional supplementary information. In this paper, we propose a novel Semantic-guided Spatio-temporal Attention (SGSTA) approach for few-shot action recognition. The main idea of SGSTA is to exploit the semantic information contained in the text embedding of labels to guide attention to more accurately capture the rich spatio-temporal context in videos when visual content is insufficient. Specifically, SGSTA comprises two essential components: a visual-text alignment module and a semantic-guided spatio-temporal attention module. The former is used to align visual features and text embeddings to eliminate semantic gaps between them. The latter is further divided into spatial attention and temporal attention. Firstly, a semantic-guided spatial attention is applied on the frame feature map to focus on semantically relevant spatial regions. Then, a semantic-guided temporal attention is used to encode the semantically enhanced temporal context with a temporal Transformer. Finally, use the spatio-temporally contextual representation obtained to learn relationship matching between support and query sequences. In this way, SGSTA can fully utilize rich semantic priors in label embeddings to improve class-specific discriminability and achieve accurate few-shot recognition. Comprehensive experiments on four challenging benchmarks demonstrate that the proposed SGSTA is effective and achieves competitive performance over existing state-of-the-art methods under various settings.

VSA: Adaptive Visual and Semantic Guided Attention on Few-Shot Learning

Semantic-Aligned Attention with Refining Feature Embedding for Few-Shot Image Classification

Attributes-Guided and Pure-Visual Attention Alignment for Few-Shot Recognition

Multi-Attention Based Visual-Semantic Interaction for Few-Shot Learning

Selectively Augmented Attention Network for Few-Shot Image Classification

Channel-spatial attention network for fewshot classification

SgVA-CLIP: Semantic-Guided Visual Adapting of Vision-Language Models for Few-Shot Image Classification

Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning

Semantic-guided spatio-temporal attention for few-shot action recognition

FewVS: A Vision-Semantics Integration Framework for Few-Shot Image Classification

Self-Enhanced Mixed Attention Network for Three-Modal Images Few-Shot Semantic Segmentation

Semantic-Based Few-Shot Learning by Interactive Psychometric Testing

SpatialFormer: Semantic and Target Aware Attentions for Few-Shot Learning

Simple Semantic-Aided Few-Shot Learning

A Self-Distillation Embedded Supervised Affinity Attention Model for Few-Shot Segmentation

Adaptive Cross-Modal Few-Shot Learning

Learn to Pay Attention Via Switchable Attention for Image Recognition

Attention-Based Multi-Context Guiding for Few-Shot Semantic Segmentation

Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment

Few-Shot Learning Based on Deep Learning for Image Classification

Boosting Few-Shot Segmentation via Instance-Aware Data Augmentation and Local Consensus Guided Cross Attention