MixPrompt: Enhancing Generalizability and Adversarial Robustness for Vision-Language Models Via Prompt Fusion

Hao Fan,Zhaoyang Ma,Yong Li,Rui Tian,Yunli Chen,Chenlong Gao
DOI: https://doi.org/10.1007/978-981-97-5606-3_28
2024-01-01
Abstract:Pretrained Vision-Language Models (VLMs) like CLIP have exhibited remarkable capacities across downstream tasks, while their image encoders are vulnerable to adversarial examples. A recently introduced lightweight approach, termed Adversarial Prompt Tuning (AdvPT), utilizes adversarial examples for training learnable prompts, enhancing the adversarial robustness of VLMs solely through manipulation of textual inputs. However, the static prompts learned from AdvPT overfit base classes observed during training, compromising the model's generalizability. In this paper, we propose a conditional Adversarial Prompt Tuning method, which extends AdvPT by further learning a network to generate for each input a specific prompt. The dynamic prompts enhance the generalizability of VLMs on unseen classes. Furthermore, since VLMs are inherently powerful generalizers, we try to incorporate the manual prompts used by VLMs in the testing phase to further improve the generalizability of the model. Extensive experiments on 8 datasets demonstrate that our prompt fusion based method significantly outperforms AdvPT on unseen classes, enhancing the generalizability and adversarial robustness of VLMs simultaneously.
What problem does this paper attempt to address?