Abstract: Since the release of various large-scale natural language processing (NLP) pre-trained models, parameter efficient transfer learning (PETL) has become a popular paradigm capable of achieving impressive performance on various downstream tasks. PETL aims at making good use of the representation knowledge in the pre-trained large models by fine-tuning a small number of parameters. Recently, it has also attracted increasing attention to developing various PETL techniques for vision tasks. Popular PETL techniques such as Prompt-tuning and Adapter have been proposed for high-level visual downstream tasks such as image classification and video recognition. However, Prefix-tuning remains under-explored for vision tasks. In this work, we intend to adapt large video-based models to downstream tasks with a good parameter-accuracy trade-off. Towards this goal, we propose a framework with a unified view called visual-PETL (V-PETL) to investigate the different aspects affecting the trade-off. Specifically, we analyze the positional importance of trainable parameters and differences between NLP and vision tasks in terms of data structures and pre-training mechanisms while implementing various PETL techniques, especially for the under-explored prefix-tuning technique. Based on a comprehensive understanding of differences between NLP and video data, we propose a new variation of prefix-tuning module called parallel attention (PATT) for video-based downstream tasks. An extensive empirical analysis on two video datasets via different frozen backbones has been carried and the findings show that the proposed PATT can effectively contribute to other PETL techniques. An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin with slightly more parameters and outperforms full-tuning with far less parameters.

Parameter-Efficient Transfer Learning for Audio-Visual-Language Tasks.

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

Efficient Transfer Learning for Video-language Foundation Models

When Vision Models Meet Parameter Efficient Look-Aside Adapters Without Large-Scale Audio Pretraining

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

Cross-Modal Adapter for Text-Video Retrieval

VMT-Adapter: Parameter-Efficient Transfer Learning for Multi-Task Dense Scene Understanding

MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval

MPT4LM: Multi-Modal Prompt Tuning Makes Pre-Trained Large Language Models Better Vision-Language Learners

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers

READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling

Token Mixing: Parameter-Efficient Transfer Learning from Image-Language to Video-Language.

A Wander Through the Multimodal Landscape: Efficient Transfer Learning via Low-rank Sequence Multimodal Adapter

Towards a Unified View on Visual Parameter-Efficient Transfer Learning

p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

One for All: Video Conversation is Feasible Without Video Instruction Tuning