Abstract:Prompt Tuning, conditioning on task-specific learned prompt vectors, has emerged as a data-efficient and parameter-efficient method for adapting large pretrained vision-language models to multiple downstream tasks. However, existing approaches usually consider learning prompt vectors for each task independently from scratch, thereby failing to exploit the rich shareable knowledge across different vision-language tasks. In this paper, we propose multitask vision-language prompt tuning (MVLPT), which incorporates cross-task knowledge into prompt tuning for vision-language models. Specifically, (i) we demonstrate the effectiveness of learning a single transferable prompt from multiple source tasks to initialize the prompt for each target task; (ii) we show many target tasks can benefit each other from sharing prompt vectors and thus can be jointly learned via multitask prompt tuning. We benchmark the proposed MVLPT using three representative prompt tuning methods, namely text prompt tuning, visual prompt tuning, and the unified vision-language prompt tuning. Results in 20 vision tasks demonstrate that the proposed approach outperforms all single-task baseline prompt tuning methods, setting the new state-of-the-art on the few-shot ELEVATER benchmarks and cross-task generalization benchmarks. To understand where the cross-task knowledge is most effective, we also conduct a large-scale study on task transferability with 20 vision tasks in 400 combinations for each prompt tuning method. It shows that the most performant MVLPT for each prompt tuning method prefers different task combinations and many tasks can benefit each other, depending on their visual similarity and label similarity. Code is available at <a class="link-external link-https" href="https://github.com/sIncerass/MVLPT" rel="external noopener nofollow">this https URL</a>.

CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

Neural Collapse Anchored Prompt Tuning for Generalizable Vision-Language Models

Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

Efficient Test-Time Prompt Tuning for Vision-Language Models

CVPT: Cross-Attention help Visual Prompt Tuning adapt visual task

Visual Prompt Tuning

MePT: Multi-Representation Guided Prompt Tuning for Vision-Language Model

LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Multitask Vision-Language Prompt Tuning

Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?

Prompt Tuning with Soft Context Sharing for Vision-Language Models

Progressive Multi-modal Conditional Prompt Tuning

Unified Vision and Language Prompt Learning

Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning

SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models

In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model

Improving Visual Prompt Tuning for Self-supervised Vision Transformers

Revisiting Prompt Pretraining of Vision-Language Models

VPPT: Visual Pre-Trained Prompt Tuning Framework for Few-Shot Image Classification

Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning?