Abstract:Pretrained large-scale vision-language models such as CLIP have demonstrated excellent generalizability over a series of downstream tasks. However, they are sensitive to the variation of input text prompts and need a selection of prompt templates to achieve satisfactory performance. Recently, various methods have been proposed to dynamically learn the prompts as the textual inputs to avoid the requirements of laboring hand-crafted prompt engineering in the fine-tuning process. We notice that these methods are suboptimal in two aspects. First, the prompts of the vision and language branches in these methods are usually separated or uni-directionally correlated. Thus, the prompts of both branches are not fully correlated and may not provide enough guidance to align the representations of both branches. Second, it's observed that most previous methods usually achieve better performance on seen classes but cause performance degeneration on unseen classes compared to CLIP. This is because the essential generic knowledge learned in the pretraining stage is partly forgotten in the fine-tuning process. In this paper, we propose Co-Articulated Multi-Modal Learning (COMMA) to handle the above limitations. Especially, our method considers prompts from both branches to generate the prompts to enhance the representation alignment of both branches. Besides, to alleviate forgetting about the essential knowledge, we minimize the feature discrepancy between the learned prompts and the embeddings of hand-crafted prompts in the pre-trained CLIP in the late transformer layers. We evaluate our method across three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Experimental results demonstrate the superiority of our method by exhibiting a favorable performance boost upon all tasks with high efficiency.

PromptLearner-CLIP: Contrastive Multi-Modal Action Representation Learning with Context Optimization

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization

ActionCLIP: A New Paradigm for Video Action Recognition

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

MaPLe: Multi-modal Prompt Learning

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

Multi-modal Prompting for Low-Shot Temporal Action Localization

Deeply Coupled Cross-Modal Prompt Learning

ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization

COMMA: Co-Articulated Multi-Modal Learning

CLIP-SP: Vision-language Model with Adaptive Prompting for Scene Parsing

Multi-modal Attribute Prompting for Vision-Language Models

Self-supervised pretext task collaborative multi-view contrastive learning for video action recognition

Cross-Modal Contrastive Learning Network for Few-Shot Action Recognition

Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

Spatio-Temporal Context Prompting for Zero-Shot Action Detection

Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D Action Representation Learning

ActivityCLIP: Enhancing Group Activity Recognition by Mining Complementary Information from Text to Supplement Image Modality

Fine-grained Knowledge Graph-driven Video-Language Learning for Action Recognition

GBC: Guided Alignment and Adaptive Boosting CLIP Bridging Vision and Language for Robust Action Recognition