Prompt Generation Networks for Input-Space Adaptation of Frozen Vision Transformers

Jochem Loedeman,Maarten C. Stol,Tengda Han,Yuki M. Asano
2024-08-31
Abstract:With the introduction of the transformer architecture in computer vision, increasing model scale has been demonstrated as a clear path to achieving performance and robustness gains. However, with model parameter counts reaching the billions, classical finetuning approaches are becoming increasingly limiting and even unfeasible when models become hosted as inference APIs, as in NLP. Visual input-prompt learning, an adaptation technique in which additional inputs in visual (RGB) space are learned, has emerged as a potential solution for adapting frozen and cloud-hosted models, requiring neither access to the forward pass, nor post-processing. Yet so far, these constraints have deteriorated adaptation performances significantly. To this end, we propose the Prompt Generation Network (PGN) that generates a different prompt for every data point, which is then used to adapt a frozen pretrained vision model to a target task. We show that the PGN effectively adapts pretrained models to various new datasets: It surpasses previous methods by a large margin on 12/12 datasets and even outperforms full-finetuning on 5/12, while requiring 100x fewer parameters. Lastly, we introduce the "prompt inversion" trick, with which PGNs can be efficiently trained in a latent space but deployed in RGB input space for inference.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the limitations of large - scale pre - trained vision models when adapting to new tasks or datasets. Specifically, as the number of parameters in large pre - trained models such as Vision Transformers reaches billions, traditional fine - tuning methods become increasingly impractical, especially when the model is served through an API interface. These models are usually hosted on the cloud and do not allow direct modification of their weights or internal forward - propagation processes. To solve this problem, the paper proposes a new method - **Prompt Generation Networks (PGN)** - for generating specific prompts for each input image. This method can not only effectively adapt the pre - trained model to new tasks or datasets, but also requires far fewer parameters than traditional fine - tuning methods. PGN generates prompts for each image by combining feature vectors selected from a co - learned library, thereby achieving efficient adaptation of the pre - trained model. ### Main contributions 1. **Developed a simple and effective framework**: Learn visual prompts depending on input images through PGN. 2. **Proposed a new inference mode**: Decouple PGN from large - scale models so that it can be deployed in a client - server architecture, adapting to the current industry development trend. 3. **Demonstrated generalization ability and state - of - the - art performance on a wide range of datasets and architectures**: On 12 datasets, PGN outperforms existing methods and in some cases even exceeds the results of full - scale fine - tuning. 4. **Through quantitative and qualitative analysis**: Showed the mechanism of "division of labor and cooperation" formed between the frozen model and PGN. ### Method overview - **Prompt Learning**: Traditional prompt learning methods adjust pre - trained models by learning a fixed set of prompt vectors, but these prompt vectors are shared across the entire dataset, limiting the flexibility of the model. - **Prompt Generation Networks (PGN)**: PGN learns to generate prompt vectors depending on input images through a lightweight neural network, and these prompt vectors are combinations of feature vectors selected from a co - learned library. - **Prompt Inversion**: To adapt to models that cannot modify the first layer (such as models accessed through an API), PGN uses the "prompt inversion" technique to convert prompt vectors into input RGB patches and attach them to the bottom of the input image, thereby achieving prompts in the input space. ### Experimental results - **Multi - dataset experiments**: Experiments were carried out on 12 public image datasets, and the results showed that PGN significantly outperforms existing methods on multiple datasets and even exceeds the results of full - scale fine - tuning in some cases. - **Parameter efficiency**: PGN requires far fewer parameters than full - scale fine - tuning but still maintains the same overall performance. - **Multi - dataset adaptation**: PGN can be trained on multiple datasets simultaneously, automatically discover features in different domains, and adapt to mixed - domain or unordered datasets. In conclusion, by proposing PGN, this paper solves the limitations of large - scale pre - trained vision models when adapting to new tasks or datasets and provides an efficient and flexible solution.