Few-shot Action Recognition with Captioning Foundation Models

Xiang Wang,Shiwei Zhang,Hangjie Yuan,Yingya Zhang,Changxin Gao,Deli Zhao,Nong Sang
2023-10-16
Abstract:Transferring vision-language knowledge from pretrained multimodal foundation models to various downstream tasks is a promising direction. However, most current few-shot action recognition methods are still limited to a single visual modality input due to the high cost of annotating additional textual descriptions. In this paper, we develop an effective plug-and-play framework called CapFSAR to exploit the knowledge of multimodal models without manually annotating text. To be specific, we first utilize a captioning foundation model (i.e., BLIP) to extract visual features and automatically generate associated captions for input videos. Then, we apply a text encoder to the synthetic captions to obtain representative text embeddings. Finally, a visual-text aggregation module based on Transformer is further designed to incorporate cross-modal spatio-temporal complementary information for reliable few-shot matching. In this way, CapFSAR can benefit from powerful multimodal knowledge of pretrained foundation models, yielding more comprehensive classification in the low-shot regime. Extensive experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods and achieves state-of-the-art performance. The code will be made publicly available.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **how to effectively utilize multi - modal knowledge to improve classification accuracy in few - shot action recognition (Few - shot Action Recognition)**. Specifically, most of the existing few - shot action recognition methods are limited to a single visual - modality input because it is costly to label additional text descriptions. This has led to a lack of information, especially in cases where data is limited. ### Core problems of the paper 1. **Information scarcity problem**: Under few - shot conditions, relying solely on information from a single visual modality is not sufficient for reliable classification. 2. **Utilization of multi - modal knowledge**: How to use the knowledge of pre - trained multi - modal models to enhance classification performance without manually labeling text. ### Solutions To solve the above problems, the paper proposes a framework named **CapFSAR**, and its main innovations include: - **Automatically generating text descriptions**: By using pre - trained multi - modal models (such as BLIP), text descriptions are automatically generated for input videos, eliminating the need for manual text labeling. - **Cross - modal aggregation module**: A Transformer - based visual - text aggregation module is designed to fuse visual and text features, capture spatio - temporal complementary information, and further enhance the model's temporal perception ability. ### Method overview 1. **Visual encoder**: Extract visual features from the input video. 2. **Subtitle decoder**: Automatically generate text descriptions based on visual features. 3. **Text encoder**: Encode the generated text descriptions to obtain text embeddings. 4. **Visual - text aggregation module**: Fuse visual and text features, perform cross - modal interactions, and enhance video representations. 5. **Few - shot metric**: Apply temporal metrics (such as OTAM) to calculate the similarity between the support set and the query set to complete the classification task. ### Experimental results The paper conducted extensive experiments on multiple standard few - shot benchmark datasets, and the results show that CapFSAR significantly outperforms existing methods and achieves state - of - the - art performance. ### Summary CapFSAR solves the information scarcity problem in few - shot action recognition by automatically generating text descriptions and using the knowledge of multi - modal pre - trained models, improving classification accuracy. This method not only simplifies the data - labeling process but also makes full use of multi - modal information, providing new ideas for future research.