M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

Mengmeng Wang,Jiazheng Xing,Boyuan Jiang,Jun Chen,Jianbiao Mei,Xingxing Zuo,Guang Dai,Jingdong Wang,Yong Liu
2024-01-22
Abstract:Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend to prioritize strong supervised performance at the expense of compromising the models' generalization capabilities during transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named \name to address these challenges, preserving both high supervised performance and robust transferability. Firstly, to enhance the individual modality architectures, we introduce multimodal adapters to both the visual and text branches. Specifically, we design a novel visual TED-Adapter, that performs global Temporal Enhancement and local temporal Difference modeling to improve the temporal representation capabilities of the visual encoder. Moreover, we adopt text encoder adapters to strengthen the learning of semantic label information. Secondly, we design a multi-task decoder with a rich set of supervisory signals to adeptly satisfy the need for strong supervised performance and generalization within a multimodal framework. Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve two key problems in video action recognition tasks: **the trade - off between high - supervised performance and strong generalization ability**. Specifically: 1. **High - supervised performance**: Existing methods often sacrifice the generalization ability of the model when pursuing strong supervised performance. Especially when applying large - scale vision - language pre - training models (such as CLIP) to video action recognition, directly fine - tuning the entire network can improve the performance of specific tasks, but it is computationally expensive and may affect the original generalization ability of CLIP. 2. **Strong generalization ability**: In order to maintain the generalization ability of CLIP, some studies have adopted parameter - efficient fine - tuning techniques (PEFT). By freezing the original CLIP parameters and introducing adapters, the number of learnable parameters is reduced. However, freezing the backbone network in a multi - modal framework in these methods will lead to a decline in supervised performance, especially performing poorly in zero - shot scenarios. To solve these problems, the paper proposes a new multi - modal, multi - task adaptation framework - **M2 - CLIP**, aiming to achieve **high - supervised performance and strong generalization ability** simultaneously. The main contributions of M2 - CLIP include: 1. **Multi - modal adapter design**: - **TED - Adapter**: For the video encoder, a novel Temporal Enhancement and Difference - modeling Adapter (TED - Adapter) is designed, which can capture global temporal enhancement and local temporal difference features simultaneously, thereby enhancing the temporal representation ability of the video encoder. - **Text adapter**: An adapter is introduced for the text encoder to enhance the learning of semantic information of action labels, making the label representation more distinguishable and adjustable. 2. **Multi - task decoder**: - A multi - task decoder consisting of four components is designed to improve the joint representation ability of the entire multi - modal framework through multiple task constraints: - **Contrastive learning head**: Align video and text representations to ensure semantic consistency between them. - **Cross - modal classification head**: Highlight the discriminative ability of cross - modal features to help the model better identify videos of different categories. - **Cross - modal masked language modeling head**: Promote visual features to focus on verbs, improving the accuracy of action recognition. - **Visual classification head**: Enhance the discrimination of video features among different categories, further improving the performance of supervised learning. Through these designs, M2 - CLIP not only performs well in supervised learning, but also has strong generalization ability in zero - shot scenarios. The experimental results verify the effectiveness of this method and achieve excellent results on multiple datasets. ### Summary By introducing multi - modal adapters and multi - task decoders, M2 - CLIP successfully solves the contradiction between high - supervised performance and strong generalization ability in video action recognition tasks and achieves a balance between them.