Abstract:Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend to prioritize strong supervised performance at the expense of compromising the models' generalization capabilities during transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named \name to address these challenges, preserving both high supervised performance and robust transferability. Firstly, to enhance the individual modality architectures, we introduce multimodal adapters to both the visual and text branches. Specifically, we design a novel visual TED-Adapter, that performs global Temporal Enhancement and local temporal Difference modeling to improve the temporal representation capabilities of the visual encoder. Moreover, we adopt text encoder adapters to strengthen the learning of semantic label information. Secondly, we design a multi-task decoder with a rich set of supervisory signals to adeptly satisfy the need for strong supervised performance and generalization within a multimodal framework. Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.

What problem does this paper attempt to address?

This paper attempts to solve two key problems in video action recognition tasks: **the trade - off between high - supervised performance and strong generalization ability**. Specifically: 1. **High - supervised performance**: Existing methods often sacrifice the generalization ability of the model when pursuing strong supervised performance. Especially when applying large - scale vision - language pre - training models (such as CLIP) to video action recognition, directly fine - tuning the entire network can improve the performance of specific tasks, but it is computationally expensive and may affect the original generalization ability of CLIP. 2. **Strong generalization ability**: In order to maintain the generalization ability of CLIP, some studies have adopted parameter - efficient fine - tuning techniques (PEFT). By freezing the original CLIP parameters and introducing adapters, the number of learnable parameters is reduced. However, freezing the backbone network in a multi - modal framework in these methods will lead to a decline in supervised performance, especially performing poorly in zero - shot scenarios. To solve these problems, the paper proposes a new multi - modal, multi - task adaptation framework - **M2 - CLIP**, aiming to achieve **high - supervised performance and strong generalization ability** simultaneously. The main contributions of M2 - CLIP include: 1. **Multi - modal adapter design**: - **TED - Adapter**: For the video encoder, a novel Temporal Enhancement and Difference - modeling Adapter (TED - Adapter) is designed, which can capture global temporal enhancement and local temporal difference features simultaneously, thereby enhancing the temporal representation ability of the video encoder. - **Text adapter**: An adapter is introduced for the text encoder to enhance the learning of semantic information of action labels, making the label representation more distinguishable and adjustable. 2. **Multi - task decoder**: - A multi - task decoder consisting of four components is designed to improve the joint representation ability of the entire multi - modal framework through multiple task constraints: - **Contrastive learning head**: Align video and text representations to ensure semantic consistency between them. - **Cross - modal classification head**: Highlight the discriminative ability of cross - modal features to help the model better identify videos of different categories. - **Cross - modal masked language modeling head**: Promote visual features to focus on verbs, improving the accuracy of action recognition. - **Visual classification head**: Enhance the discrimination of video features among different categories, further improving the performance of supervised learning. Through these designs, M2 - CLIP not only performs well in supervised learning, but also has strong generalization ability in zero - shot scenarios. The experimental results verify the effectiveness of this method and achieve excellent results on multiple datasets. ### Summary By introducing multi - modal adapters and multi - task decoders, M2 - CLIP successfully solves the contradiction between high - supervised performance and strong generalization ability in video action recognition tasks and achieves a balance between them.

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization

ActionCLIP: A New Paradigm for Video Action Recognition

Multi-Modal Adapter for Vision-Language Models

MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition

Action Recognition Via Fine-Tuned CLIP Model and Temporal Transformer.

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

GBC: Guided Alignment and Adaptive Boosting CLIP Bridging Vision and Language for Robust Action Recognition

Cross-Modal Adapter for Text-Video Retrieval

Cross-modality Online Distillation for Multi-View Action Recognition

OmniCLIP: Adapting CLIP for Video Recognition with Spatial-Temporal Omni-Scale Feature Learning

Efficient Transfer Learning for Video-language Foundation Models

Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning

Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding.

Fine-grained Knowledge Graph-driven Video-Language Learning for Action Recognition

Leveraging Temporal Contextualization for Video Action Recognition

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment