Abstract:Transferring vision-language knowledge from pretrained multimodal foundation models to various downstream tasks is a promising direction. However, most current few-shot action recognition methods are still limited to a single visual modality input due to the high cost of annotating additional textual descriptions. In this paper, we develop an effective plug-and-play framework called CapFSAR to exploit the knowledge of multimodal models without manually annotating text. To be specific, we first utilize a captioning foundation model (i.e., BLIP) to extract visual features and automatically generate associated captions for input videos. Then, we apply a text encoder to the synthetic captions to obtain representative text embeddings. Finally, a visual-text aggregation module based on Transformer is further designed to incorporate cross-modal spatio-temporal complementary information for reliable few-shot matching. In this way, CapFSAR can benefit from powerful multimodal knowledge of pretrained foundation models, yielding more comprehensive classification in the low-shot regime. Extensive experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods and achieves state-of-the-art performance. The code will be made publicly available.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **how to effectively utilize multi - modal knowledge to improve classification accuracy in few - shot action recognition (Few - shot Action Recognition)**. Specifically, most of the existing few - shot action recognition methods are limited to a single visual - modality input because it is costly to label additional text descriptions. This has led to a lack of information, especially in cases where data is limited. ### Core problems of the paper 1. **Information scarcity problem**: Under few - shot conditions, relying solely on information from a single visual modality is not sufficient for reliable classification. 2. **Utilization of multi - modal knowledge**: How to use the knowledge of pre - trained multi - modal models to enhance classification performance without manually labeling text. ### Solutions To solve the above problems, the paper proposes a framework named **CapFSAR**, and its main innovations include: - **Automatically generating text descriptions**: By using pre - trained multi - modal models (such as BLIP), text descriptions are automatically generated for input videos, eliminating the need for manual text labeling. - **Cross - modal aggregation module**: A Transformer - based visual - text aggregation module is designed to fuse visual and text features, capture spatio - temporal complementary information, and further enhance the model's temporal perception ability. ### Method overview 1. **Visual encoder**: Extract visual features from the input video. 2. **Subtitle decoder**: Automatically generate text descriptions based on visual features. 3. **Text encoder**: Encode the generated text descriptions to obtain text embeddings. 4. **Visual - text aggregation module**: Fuse visual and text features, perform cross - modal interactions, and enhance video representations. 5. **Few - shot metric**: Apply temporal metrics (such as OTAM) to calculate the similarity between the support set and the query set to complete the classification task. ### Experimental results The paper conducted extensive experiments on multiple standard few - shot benchmark datasets, and the results show that CapFSAR significantly outperforms existing methods and achieves state - of - the - art performance. ### Summary CapFSAR solves the information scarcity problem in few - shot action recognition by automatically generating text descriptions and using the knowledge of multi - modal pre - trained models, improving classification accuracy. This method not only simplifies the data - labeling process but also makes full use of multi - modal information, providing new ideas for future research.

Few-shot Action Recognition with Captioning Foundation Models

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition

CLIP-guided Prototype Modulating for Few-shot Action Recognition

Semantic-aware Video Representation for Few-shot Action Recognition

CapS-Adapter: Caption-based MultiModal Adapter in Zero-Shot Classification

Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition

MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition

Semantic-guided spatio-temporal attention for few-shot action recognition

On the Importance of Spatial Relations for Few-shot Action Recognition

The nature of respiratory changes associated with sleep onset.

TAMT: Temporal-Aware Model Tuning for Cross-Domain Few-Shot Action Recognition

Few-shot Adaptation of Multi-modal Foundation Models: A Survey

Understanding the Cross-Domain Capabilities of Video-Based Few-Shot Action Recognition Models

Motion-modulated Temporal Fragment Alignment Network for Few-Shot Action Recognition

Auxiliary feature extractor and dual attention-based image captioning

Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition

TFRS: A task-level feature rectification and separation method for few-shot video action recognition

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition