Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting

Syed Talal Wasim,Muzammal Naseer,Salman Khan,Fahad Shahbaz Khan,Mubarak Shah
2023-04-07
Abstract:Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Finetuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability causes significant drop in supervised accuracy. Because of this, recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance. Our codes/models are released at <a class="link-external link-https" href="https://github.com/TalalWasim/Vita-CLIP" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Image and Video Processing
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to address the trade-off between supervised learning and zero-shot generalization performance in video classification tasks using pre-trained image-text models (such as CLIP). Specifically, existing methods typically require training two separate models to handle supervised learning and zero-shot recognition tasks independently. However, this approach is not only inefficient but also fails to optimize both performances simultaneously. Therefore, this paper proposes a multimodal prompt learning scheme, Vita-CLIP, which balances these two capabilities within a single model: 1. **Maintaining Zero-Shot Generalization Ability**: By freezing the backbone network of the pre-trained model, its original generalization ability is preserved. 2. **Improving Supervised Learning Performance**: Introducing a multimodal prompt mechanism allows the model to effectively adapt to video data and perform well in supervised learning tasks. This method is particularly suitable for video classification scenarios. By introducing different prompt strategies on the visual and textual ends, it achieves state-of-the-art zero-shot performance on multiple benchmark datasets while also being competitive in supervised settings.