Abstract:In this paper, we propose a novel Temporal Sequence-Aware Model (TSAM) for few-shot action recognition (FSAR), which incorporates a sequential perceiver adapter into the pre-training framework, to integrate both the spatial information and the sequential temporal dynamics into the feature embeddings. Different from the existing fine-tuning approaches that capture temporal information by exploring the relationships among all the frames, our perceiver-based adapter recurrently captures the sequential dynamics alongside the timeline, which could perceive the order change. To obtain the discriminative representations for each class, we extend a textual corpus for each class derived from the large language models (LLMs) and enrich the visual prototypes by integrating the contextual semantic information. Besides, We introduce an unbalanced optimal transport strategy for feature matching that mitigates the impact of class-unrelated features, thereby facilitating more effective decision-making. Experimental results on five FSAR datasets demonstrate that our method set a new benchmark, beating the second-best competitors with large margins.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the insufficient modeling of time - series information in few - shot action recognition (FSAR). Specifically, when dealing with video data, existing FSAR methods are often unable to effectively capture the influence of the sequential changes between frames on action category judgment. For example, simply adjusting the order of video frames may significantly change the action category of the video, but existing models have difficulty in distinguishing these differences. To solve this problem, the author proposes a new temporal sequence - aware model (TSAM). By introducing the sequential perceiver adapter, it can better capture the temporal order information of video frames and integrate it into the feature representation. In addition, TSAM also combines a text corpus enhancement module and an unbalanced optimal transport matching strategy to further improve the performance of the model. ### Specific problem description 1. **Lack of temporal order information**: - When dealing with video frames, existing methods usually treat all frames as equal inputs and ignore the temporal order between frames. This approach may lead to the model being unable to distinguish category changes caused by different frame orders. 2. **Lack of richness in feature representation**: - Relying solely on visual information for feature extraction may not be sufficient to fully capture the semantic information of the video. Therefore, text information needs to be introduced to enhance the feature representation. 3. **Interference from background noise**: - In the few - shot matching process, the background noise in redundant frames may interfere with the decision - making process and affect the accuracy of the model. ### Solutions 1. **Temporal sequence - aware model (TSAM)**: - Introduce the sequential perceiver adapter to recursively capture the dynamic information in the temporal dimension and perceive the changes in frame order. 2. **Text corpus enhancement**: - Use the text descriptions generated by large - language models (LLMs) to enhance the visual prototypes of each category, making them more discriminative. 3. **Unbalanced optimal transport matching**: - By introducing the unbalanced optimal transport (UOT) strategy, reduce the interference of background noise and improve the accuracy of matching. Through these improvements, TSAM has achieved significant performance improvements on multiple few - shot action recognition datasets, surpassing the existing best competitors.

Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action Recognition

On the Importance of Spatial Relations for Few-shot Action Recognition

Semantic-aware Video Representation for Few-shot Action Recognition

Semantic-guided spatio-temporal attention for few-shot action recognition

Temporal Distinct Representation Learning for Action Recognition

MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition

Elastic Temporal Alignment for Few‐shot Action Recognition

MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition

Motion-modulated Temporal Fragment Alignment Network for Few-Shot Action Recognition

An efficient framework for few-shot skeleton-based temporal action segmentation

TAMT: Temporal-Aware Model Tuning for Cross-Domain Few-Shot Action Recognition

Few-shot action recognition with implicit temporal alignment and pair similarity optimization

Exploring Frame Segmentation Networks for Temporal Action Localization

Two-Stream Temporal Feature Aggregation Based on Clustering for Few-Shot Action Recognition

Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition

SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition

Alignment-guided Temporal Attention for Video Action Recognition

Few-Shot Video Classification via Temporal Alignment

Enhancing Few-Shot Action Recognition Using Skeleton Temporal Alignment and Adversarial Training