Abstract:Pre-trained vision-language models (VLMs) have showcased remarkable performance in image and natural language understanding, such as image captioning and response generation. As the practical applications of vision-language models become increasingly widespread, their potential safety and robustness issues raise concerns that adversaries may evade the system and cause these models to generate toxic content through malicious attacks. Therefore, evaluating the robustness of open-source VLMs against adversarial attacks has garnered growing attention, with transfer-based attacks as a representative black-box attacking strategy. However, most existing transfer-based attacks neglect the importance of the semantic correlations between vision and text modalities, leading to sub-optimal adversarial example generation and attack performance. To address this issue, we present Chain of Attack (CoA), which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update using a series of intermediate attacking steps, achieving superior adversarial transferability and efficiency. A unified attack success rate computing method is further proposed for automatic evasion evaluation. Extensive experiments conducted under the most realistic and high-stakes scenario, demonstrate that our attacking strategy can effectively mislead models to generate targeted responses using only black-box attacks without any knowledge of the victim models. The comprehensive robustness evaluation in our paper provides insight into the vulnerabilities of VLMs and offers a reference for the safety considerations of future model developments.

Leveraging Transferability and Improved Beam Search in Textual Adversarial Attacks

You See What I Want You to See: Exploring Targeted Black-Box Transferability Attack for Hash-based Image Retrieval Systems

Textual Adversarial Attack As Combinatorial Optimization

Towards Variable-Length Textual Adversarial Attacks

A Context-Aware Approach for Textual Adversarial Attack through Probability Difference Guided Beam Search

Semantic-Preserving Adversarial Text Attacks

Word-level Textual Adversarial Attacking as Combinatorial Optimization

Bridge the Gap Between CV and NLP! A Gradient-based Textual Adversarial Attack Framework

BFS2Adv: Black-Box Adversarial Attack Towards Hard-to-Attack Short Texts

TF-Attack: Transferable and Fast Adversarial Attacks on Large Language Models

Learning to Attack: Towards Textual Adversarial Attacking in Real-world Situations

TextCheater: A Query-Efficient Textual Adversarial Attack in the Hard-Label Setting

Searching for a Search Method: Benchmarking Search Algorithms for Generating NLP Adversarial Examples

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Large Language Models

On the Transferability of Adversarial Attacksagainst Neural Text Classifier

Open the Boxes of Words: Incorporating Sememes into Textual Adversarial Attack

Mutual-modality Adversarial Attack with Semantic Perturbation

Bigram and Unigram Based Text Attack Via Adaptive Monotonic Heuristic Search

Searching for an Effective Defender: Benchmarking Defense Against Adversarial Word Substitution

Towards Improving Adversarial Training of NLP Models