Abstract:In recent years, the rapid expansion of model sizes has led to large-scale pre-trained models demonstrating remarkable capabilities. Consequently, there has been a trend towards increasing the scale of models. However, this trend introduces significant challenges, including substantial computational costs of training and transfer to downstream tasks. To address these issues, Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced. These methods optimize large-scale pre-trained models for specific tasks by fine-tuning a select group of parameters. Among these PEFT methods, adapter-based and prompt-based methods are the primary techniques. Specifically, in the field of visual fine-tuning, adapters gain prominence over prompts because of the latter's relatively weaker performance and efficiency. Under the circumstances, we refine the widely-used Visual Prompt Tuning (VPT) method, proposing Cross Visual Prompt Tuning (CVPT). CVPT calculates cross-attention between the prompt tokens and the embedded tokens, which allows us to compute the semantic relationship between them and conduct the fine-tuning of models exactly to adapt visual tasks better. Furthermore, we introduce the weight-sharing mechanism to initialize the parameters of cross-attention, which avoids massive learnable parameters from cross-attention and enhances the representative capability of cross-attention. We conduct comprehensive testing across 25 datasets and the result indicates that CVPT significantly improves VPT's performance and efficiency in visual tasks. For example, on the VTAB-1K benchmark, CVPT outperforms VPT over 4% in average accuracy, rivaling the advanced adapter-based methods in performance and efficiency. Our experiments confirm that prompt-based methods can achieve exceptional results in visual fine-tuning.

VPPT: Visual Pre-Trained Prompt Tuning Framework for Few-Shot Image Classification

PVP: Pre-trained Visual Parameter-Efficient Tuning

Visual Prompt Tuning

Dynamic Visual Prompt Tuning for Parameter Efficient Transfer Learning

E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning

PPT: Pre-trained Prompt Tuning for Few-shot Learning

CVPT: Cross-Attention help Visual Prompt Tuning adapt visual task

LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning

CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

Improving Visual Prompt Tuning for Self-supervised Vision Transformers

Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning?

Approximated Prompt Tuning for Vision-Language Pre-trained Models

Visual Fourier Prompt Tuning

Towards Unified Prompt Tuning for Few-shot Text Classification

Parameter Efficient Point Cloud Prompt Tuning for Unified Point Cloud Understanding

Improving Prompt Tuning with Learned Prompting Layers

Pro-tuning: Unified Prompt Tuning for Vision Tasks

Revisiting the Power of Prompt for Visual Tuning

Prompt Tuning with Soft Context Sharing for Vision-Language Models

iVPT: Improving Task-relevant Information Sharing in Visual Prompt Tuning by Cross-layer Dynamic Connection

Prompting through Prototype: A Prototype-based Prompt Learning on Pretrained Vision-Language Models