Abstract:Pre-trained vision-language models (VLMs) have showcased remarkable performance in image and natural language understanding, such as image captioning and response generation. As the practical applications of vision-language models become increasingly widespread, their potential safety and robustness issues raise concerns that adversaries may evade the system and cause these models to generate toxic content through malicious attacks. Therefore, evaluating the robustness of open-source VLMs against adversarial attacks has garnered growing attention, with transfer-based attacks as a representative black-box attacking strategy. However, most existing transfer-based attacks neglect the importance of the semantic correlations between vision and text modalities, leading to sub-optimal adversarial example generation and attack performance. To address this issue, we present Chain of Attack (CoA), which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update using a series of intermediate attacking steps, achieving superior adversarial transferability and efficiency. A unified attack success rate computing method is further proposed for automatic evasion evaluation. Extensive experiments conducted under the most realistic and high-stakes scenario, demonstrate that our attacking strategy can effectively mislead models to generate targeted responses using only black-box attacks without any knowledge of the victim models. The comprehensive robustness evaluation in our paper provides insight into the vulnerabilities of VLMs and offers a reference for the safety considerations of future model developments.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to evaluate the robustness of vision - language models (VLMs) when facing transfer - based adversarial attacks. Specifically, the paper focuses on how to mislead these models to produce wrong or specific target responses by generating effective adversarial examples without knowing the internal structure of the target model (i.e., black - box attacks). The paper points out that most of the existing transfer - based adversarial attack methods ignore the semantic associations between visual and textual modalities, resulting in ineffective adversarial examples. To this end, the authors propose a new framework - Chain of Attack (CoA), which improves the transfer ability and attack efficiency of adversarial examples by iteratively updating the multimodal semantics of adversarial examples in multiple steps. In addition, the paper also proposes a unified attack success rate calculation method for automatically evaluating the evasion ability of models, providing a reference for future model security research. ### Main contributions of the paper 1. **Propose a new transfer - based targeted attack framework**: Chain of Attack (CoA), which uses an explicit step - by - step semantic update process to enhance the generation of adversarial examples, thereby improving the attack quality and success rate. 2. **Establish a unified attack success rate calculation strategy**: An automatic attack success rate (ASR) calculation method based on large - language models (LLMs), providing a fair and intuitive evaluation criterion for text generation tasks. 3. **Evaluate the security and robustness of multiple VLMs**: Evaluate VLMs of different scales using black - box attack methods, demonstrating the effectiveness of the proposed method and revealing the vulnerabilities of existing VLMs. ### Method overview - **Problem definition**: Generate adversarial examples by modifying the input image with small but imperceptible perturbations, causing the model to produce the desired wrong response. - **Threat model**: Define the attacker's capabilities and knowledge scope, as well as the attack objective. The paper focuses on adversarial transferability in the black - box attack scenario, that is, the attacker has no direct knowledge of the target model. - **Attack chain framework**: Use multimodal semantic fusion of embeddings to capture the semantic correspondence between images and text, and achieve more effective attacks by gradually updating the multimodal semantics of adversarial examples. - **Targeted Contrastive Matching (TCM)**: Maximize the similarity between the current adversarial example and the target reference example, while minimizing the similarity between the current adversarial example and the original clean example, to optimize image perturbation. - **LLM - based ASR**: Use large - language models as a substitute for human judges to automatically determine whether the model has been successfully attacked, providing a clear and unified attack success rate calculation method. ### Experimental results The paper verifies the effectiveness of the proposed method through extensive experiments. Especially in the black - box attack scenario, the CoA framework can significantly improve the attack performance, and even outperform query - based methods in some cases with lower computational costs. The experimental results also reveal the vulnerabilities of VLMs of different scales in adversarial attacks, providing an important reference for future research and model development.

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

On Evaluating Adversarial Robustness of Large Vision-Language Models

Towards Adversarial Attack on Vision-Language Pre-training Models

Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation

A Hybrid Defense Strategy for Boosting Adversarial Robustness in Vision-Language Models

Efficient Generation of Targeted and Transferable Adversarial Examples for Vision-Language Models Via Diffusion Models

VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models

Highly Transferable Diffusion-based Unrestricted Adversarial Attack on Pre-trained Vision-Language Models

Improving Adversarial Transferability of Vision-Language Pre-training Models through Collaborative Multimodal Interaction

Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models

Mutual-modality Adversarial Attack with Semantic Perturbation

Break the Visual Perception: Adversarial Attacks Targeting Encoded Visual Tokens of Large Vision-Language Models

Fooling Vision and Language Models Despite Localization and Attention Mechanism

Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models

A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models

Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics

Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors