Abstract:Multi-modal prompt learning is a high-performance and cost-effective learning paradigm, which learns text as well as image prompts to tune pre-trained vision-language (V-L) models like CLIP for adapting multiple downstream tasks. However, recent methods typically treat text and image prompts as independent components without considering the dependency between prompts. Moreover, extending multi-modal prompt learning into the medical field poses challenges due to a significant gap between general- and medical-domain data. To this end, we propose a Multi-modal Collaborative Prompt Learning (MCPL) pipeline to tune a frozen V-L model for aligning medical text-image representations, thereby achieving medical downstream tasks. We first construct the anatomy-pathology (AP) prompt for multi-modal prompting jointly with text and image prompts. The AP prompt introduces instance-level anatomy and pathology information, thereby making a V-L model better comprehend medical reports and images. Next, we propose graph-guided prompt collaboration module (GPCM), which explicitly establishes multi-way couplings between the AP, text, and image prompts, enabling collaborative multi-modal prompt producing and updating for more effective prompting. Finally, we develop a novel prompt configuration scheme, which attaches the AP prompt to the query and key, and the text/image prompt to the value in self-attention layers for improving the interpretability of multi-modal prompts. Extensive experiments on numerous medical classification and object detection datasets show that the proposed pipeline achieves excellent effectiveness and generalization. Compared with state-of-the-art prompt learning methods, MCPL provides a more reliable multi-modal prompt paradigm for reducing tuning costs of V-L models on medical downstream tasks. Our code: https://github.com/CUHK-AIM-Group/MCPL.

CoPL:Parameter-Efficient Collaborative Prompt Learning for Audio-Visual Tasks

Conditional Prompt Tuning for Multimodal Fusion

Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Parameter Efficient Multi-task Fine-tuning by Learning to Transfer Token-wise Prompts

Dynamic Visual Prompt Tuning for Parameter Efficient Transfer Learning

Making Pre-trained Language Models End-to-end Few-shot Learners with Contrastive Prompt Tuning

VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval

MCPL: Multi-modal Collaborative Prompt Learning for Medical Vision-Language Model

SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models

Learning to Prompt for Vision-Language Models

Pro-tuning: Unified Prompt Tuning for Vision Tasks

Parameter-efficient Tuning of Large-scale Multimodal Foundation Model

DPL: Decoupled Prompt Learning for Vision-Language Models

FPT: Improving Prompt Tuning Efficiency Via Progressive Training.

Multi-Prompt with Depth Partitioned Cross-Modal Learning

Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning

CoPL: Contextual Prompt Learning for Vision-Language Understanding

MaPLe: Multi-modal Prompt Learning

Visual Prompt Multi-Modal Tracking