Transferable Multimodal Attack on Vision-Language Pre-training Models
Haodi Wang,Kai Dong,Zhilei Zhu,Haotong Qin,Aishan Liu,Xiaolin Fang,Jiakai Wang,Xianglong Liu
DOI: https://doi.org/10.1109/sp54263.2024.00102
2024-01-01
Abstract:Vision-Language Pre-training (VLP) models have achieved remarkable success in practice, while easily being misled by adversarial attack. Though harmful, adversarial attacks are valuable in revealing the blind-spots of VLP models and promoting their robustness. However, existing adversarial attacking studies pay insufficient attention to the key roles of different modality-correlated features, leading to unsatisfactory transferable attacking performance. To tackle this issue, we propose the Transferable MultiModal (TMM) attack framework, which tailors both the modality consistency and modality discrepancy features. To promote transferability, we propose the attention-directed feature perturbation to disturb the modality-consistency features in critical attention regions. In light of the commonly employed cross-attention can represent the consistent features among diverse models, it is more possible to mislead the similar model perception for activating stronger transferability. For improving attacking ability, we proposed the orthogonal-guided feature heterogenization to guide the adversarial perturbation to contain more modality-discrepancy features in the encoded embeddings. Since VLP models rely more on aligned features among different modalities during decision-making, increasing the modality-discrepant could confuse the learned representation for better attacking ability. Extensive experiments under diverse settings demonstrate that the proposed TMM outperforms the comparisons by large margins, i.e., 20.47% improvements in transferable attacking ability on average. Moreover, we highlight that our TMM also shows outstanding attacking performance on large models, such as MiniGPT-4, Otter, etc.