Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition

Bozheng Li,Mushui Liu,Gaoang Wang,Yunlong Yu
2024-08-22
Abstract:In this paper, we propose a novel Temporal Sequence-Aware Model (TSAM) for few-shot action recognition (FSAR), which incorporates a sequential perceiver adapter into the pre-training framework, to integrate both the spatial information and the sequential temporal dynamics into the feature embeddings. Different from the existing fine-tuning approaches that capture temporal information by exploring the relationships among all the frames, our perceiver-based adapter recurrently captures the sequential dynamics alongside the timeline, which could perceive the order change. To obtain the discriminative representations for each class, we extend a textual corpus for each class derived from the large language models (LLMs) and enrich the visual prototypes by integrating the contextual semantic information. Besides, We introduce an unbalanced optimal transport strategy for feature matching that mitigates the impact of class-unrelated features, thereby facilitating more effective decision-making. Experimental results on five FSAR datasets demonstrate that our method set a new benchmark, beating the second-best competitors with large margins.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficient modeling of time - series information in few - shot action recognition (FSAR). Specifically, when dealing with video data, existing FSAR methods are often unable to effectively capture the influence of the sequential changes between frames on action category judgment. For example, simply adjusting the order of video frames may significantly change the action category of the video, but existing models have difficulty in distinguishing these differences. To solve this problem, the author proposes a new temporal sequence - aware model (TSAM). By introducing the sequential perceiver adapter, it can better capture the temporal order information of video frames and integrate it into the feature representation. In addition, TSAM also combines a text corpus enhancement module and an unbalanced optimal transport matching strategy to further improve the performance of the model. ### Specific problem description 1. **Lack of temporal order information**: - When dealing with video frames, existing methods usually treat all frames as equal inputs and ignore the temporal order between frames. This approach may lead to the model being unable to distinguish category changes caused by different frame orders. 2. **Lack of richness in feature representation**: - Relying solely on visual information for feature extraction may not be sufficient to fully capture the semantic information of the video. Therefore, text information needs to be introduced to enhance the feature representation. 3. **Interference from background noise**: - In the few - shot matching process, the background noise in redundant frames may interfere with the decision - making process and affect the accuracy of the model. ### Solutions 1. **Temporal sequence - aware model (TSAM)**: - Introduce the sequential perceiver adapter to recursively capture the dynamic information in the temporal dimension and perceive the changes in frame order. 2. **Text corpus enhancement**: - Use the text descriptions generated by large - language models (LLMs) to enhance the visual prototypes of each category, making them more discriminative. 3. **Unbalanced optimal transport matching**: - By introducing the unbalanced optimal transport (UOT) strategy, reduce the interference of background noise and improve the accuracy of matching. Through these improvements, TSAM has achieved significant performance improvements on multiple few - shot action recognition datasets, surpassing the existing best competitors.