Abstract:Prompt tuning (PT) has long been recognized as an effective and efficient paradigm for transferring large pre-trained vision-language models (VLMs) to downstream tasks by learning a tiny set of context vectors. Nevertheless, in this work, we reveal that freezing the parameters of VLMs during learning the context vectors neither facilitates the transferability of pre-trained knowledge nor improves the memory and time efficiency significantly. Upon further investigation, we find that reducing both the length and width of the feature-gradient propagation flows of the full fine-tuning (FT) baseline is key to achieving effective and efficient knowledge transfer. Motivated by this, we propose Skip Tuning, a novel paradigm for adapting VLMs to downstream tasks. Unlike existing PT or adapter-based methods, Skip Tuning applies Layer-wise Skipping (LSkip) and Class-wise Skipping (CSkip) upon the FT baseline without introducing extra context vectors or adapter modules. Extensive experiments across a wide spectrum of benchmarks demonstrate the superior effectiveness and efficiency of our Skip Tuning over both PT and adapter-based methods. Code: <a class="link-external link-https" href="https://github.com/Koorye/SkipTuning" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that when existing Prompt Tuning (PT) methods transfer large - scale pre - trained Vision - Language Models (VLMs) to downstream tasks, although they improve parameter efficiency, they do not significantly improve memory and time efficiency, and may even lead to performance degradation. Specifically: 1. **Trade - off between parameter efficiency and performance**: Although PT methods improve parameter efficiency by freezing most of the weights of the VLM and only learning a small number of context vectors, this approach does not significantly improve memory and time efficiency, and sometimes even reduces classification accuracy. 2. **Limitations of existing methods**: Differences in implementation details among many PT methods mask the actual performance improvement, especially in terms of memory and time efficiency. In addition, existing methods fail to fully optimize the memory and time efficiency of the Full Fine - tuning (FT) baseline. To solve these problems, the authors propose a new paradigm - **Skip Tuning**, aiming to achieve more efficient knowledge transfer by reducing the length and width of Feature - Gradient Propagation Flows (FGPFs) without introducing additional context vectors or adapter modules. ### Specific problem description 1. **Improvement of memory and time efficiency**: The authors observe that by reducing the length and width of FGPFs, memory and time efficiency can be significantly improved without sacrificing the performance of transfer learning. 2. **Balance between effectiveness and efficiency**: The authors propose a new method, namely Skip Tuning, which combines Layer - wise Skipping (LSkip) and Class - wise Skipping (CSkip) to reduce the length and width of FGPFs simultaneously, thereby achieving effective knowledge transfer. ### Solution The core idea of Skip Tuning is to optimize the memory and time efficiency of the FT baseline through the following two strategies: - **Layer - wise Skipping (LSkip)**: By caching the intermediate features of the first ω layers of the CLIP visual encoder and text encoder, these shallow layers are skipped during the fine - tuning process, thereby reducing the length of FGPFs. - **Class - wise Skipping (CSkip)**: By filtering out unimportant class tokens in each training image, the width of FGPFs is reduced, thereby improving memory and time efficiency. Through these strategies, Skip Tuning not only improves memory and time efficiency, but also shows superiority over existing PT and adapter methods in multiple benchmark tests. ### Main contributions 1. **Reveal that reducing the length and width of FGPFs is the key to achieving effective and efficient knowledge transfer**. 2. **Propose the Skip Tuning method, which can achieve efficient VLM transfer without relying on additional context vectors or adapter modules**. 3. **Verify the superiority of Skip Tuning in a wide range of benchmark tests, proving its effectiveness and efficiency in multiple tasks**. In summary, this paper aims to solve the deficiencies of existing PT methods in memory and time efficiency and proposes a new, more efficient transfer learning method - Skip Tuning.

Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves

Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference

Visual Prompt Tuning

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

A survey of efficient fine-tuning methods for Vision-Language Models — Prompt and Adapter

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Convolutional Bypasses Are Better Vision Transformer Adapters

Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification

Tuning Vision-Language Models with Multiple Prototypes Clustering

Task Residual for Tuning Vision-Language Models

Approximated Prompt Tuning for Vision-Language Pre-trained Models

SkipViT: Speeding Up Vision Transformers with a Token-Level Skip Connection

GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph

Parameter-Efficient Fine-Tuning With Adapters

Adapting Shortcut with Normalizing Flow: An Efficient Tuning Framework for Visual Recognition

CVPT: Cross-Attention help Visual Prompt Tuning adapt visual task

BBTv2: Towards a Gradient-Free Future with Large Language Models

Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models

SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models

Black-Box Tuning of Vision-Language Models with Effective Gradient Approximation

Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models