Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting

Syed Talal Wasim,Muzammal Naseer,Salman Khan,Fahad Shahbaz Khan,Mubarak Shah

2023-04-07

Abstract:Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Finetuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability causes significant drop in supervised accuracy. Because of this, recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance. Our codes/models are released at <a class="link-external link-https" href="https://github.com/TalalWasim/Vita-CLIP" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Image and Video Processing

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address the trade-off between supervised learning and zero-shot generalization performance in video classification tasks using pre-trained image-text models (such as CLIP). Specifically, existing methods typically require training two separate models to handle supervised learning and zero-shot recognition tasks independently. However, this approach is not only inefficient but also fails to optimize both performances simultaneously. Therefore, this paper proposes a multimodal prompt learning scheme, Vita-CLIP, which balances these two capabilities within a single model: 1. **Maintaining Zero-Shot Generalization Ability**: By freezing the backbone network of the pre-trained model, its original generalization ability is preserved. 2. **Improving Supervised Learning Performance**: Introducing a multimodal prompt mechanism allows the model to effectively adapt to video data and perform well in supervised learning tasks. This method is particularly suitable for video classification scenarios. By introducing different prompt strategies on the visual and textual ends, it achieves state-of-the-art zero-shot performance on multiple benchmark datasets while also being competitive in supervised settings.

Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Fine-tuned CLIP Models are Efficient Video Learners

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization

Semantic Residual Prompts for Continual Learning

Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model Via Interpolated Weight Optimization

VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval

CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

Zoom-shot: Fast and Efficient Unsupervised Zero-Shot Transfer of CLIP to Vision Encoders with Multimodal Loss

In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

Learning to Prompt for Vision-Language Models

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting

CLIPArTT: Adaptation of CLIP to New Domains at Test Time

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition