VPPT: Visual Pre-Trained Prompt Tuning Framework for Few-Shot Image Classification

Zhao Song,Ke Yang,Naiyang Guan,Junjie Zhu,Peng Qiao,Qingyong Hu
DOI: https://doi.org/10.1109/icassp49357.2023.10095154
2023-01-01
Abstract:Large-scale pre-trained transformers have recently achieved remarkable success in several computer vision tasks. However, it remains highly challenging to fully fine-tune models for downstream tasks, due to the expensive computational and storage cost. Recently, Parameter-Efficient Tuning (PETuning) techniques, e.g., Visual Prompt Tuning (VPT), have significantly reduced the computation cost by inserting lightweight prompt modules including prompt tokens or adapter layers, into the pre-trained models and tuning these prompt modules with a small number of trainable parameters, while keeping the transformer backbone freeze. Although encouraging results were achieved, existing PETuning methods cannot perform well under the few-shot learning settings (i.e., extremely limited training data, with only 1 or 2 shots per class), due to the scarce supervision signal. To this end, we first empirically identify the poor performance is mainly due to the inappropriate way of initializing prompt modules, which has also been verified in the pre-trained language models. Next, we propose a Visual Pre-trained Prompt Tuning framework (VPPT), which pre-trains the prompt modules first and then leverages the pre-trained modules along with the pre-trained transformer backbone to perform prompt tuning on downstream tasks. Extensive experiments show that our VPPT framework achieves 16.08% average accuracy absolute improvement under 1 shot setting on five fine-grained visual classification datasets, compared with the previous PETuning techniques, e.g., VPT, in few-shot image classification.
What problem does this paper attempt to address?