TrojVLM: Backdoor Attack Against Vision Language Models

Weimin Lyu,Lu Pang,Tengfei Ma,Haibin Ling,Chao Chen

2024-09-28

Abstract:The emergence of Vision Language Models (VLMs) is a significant advancement in integrating computer vision with Large Language Models (LLMs) to produce detailed text descriptions based on visual inputs, yet it introduces new security vulnerabilities. Unlike prior work that centered on single modalities or classification tasks, this study introduces TrojVLM, the first exploration of backdoor attacks aimed at VLMs engaged in complex image-to-text generation. Specifically, TrojVLM inserts predetermined target text into output text when encountering poisoned images. Moreover, a novel semantic preserving loss is proposed to ensure the semantic integrity of the original image content. Our evaluation on image captioning and visual question answering (VQA) tasks confirms the effectiveness of TrojVLM in maintaining original semantic content while triggering specific target text outputs. This study not only uncovers a critical security risk in VLMs and image-to-text generation but also sets a foundation for future research on securing multimodal models against such sophisticated threats.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the vulnerability of Vision Language Models (VLMs) to backdoor attacks when performing complex image - to - text generation tasks. Specifically, the paper introduces a new method named TrojVLM, which is the first backdoor attack research targeting VLMs. This attack can insert a predefined target text into the output text when encountering a contaminated image while maintaining the semantic integrity of the original image content. This not only reveals a key security risk in VLMs' image - to - text generation but also lays the foundation for future research on how to protect multimodal models from such complex threats. By introducing a new semantic preservation loss, the paper ensures that the model can maintain the semantic coherence of the output text even when inserting the target text. In addition, the paper also experimentally verifies the effectiveness of TrojVLM, especially its performance on image captioning and visual question answering (VQA) tasks. The experimental results show that TrojVLM can not only maintain a high Attack Success Rate (ASR) but also generate high - quality text output when processing clean images, thus demonstrating its potential threat and research value in practical applications.

TrojVLM: Backdoor Attack Against Vision Language Models

Backdooring Vision-Language Models with Out-Of-Distribution Data

VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models

White-box Multimodal Jailbreaks Against Large Vision-Language Models

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

AnyAttack: Targeted Adversarial Attacks on Vision-Language Models toward Any Images

ImgTrojan: Jailbreaking Vision-Language Models with ONE Image

On Evaluating Adversarial Robustness of Large Vision-Language Models

Break the Visual Perception: Adversarial Attacks Targeting Encoded Visual Tokens of Large Vision-Language Models

Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors

Visual Adversarial Examples Jailbreak Aligned Large Language Models

Towards Adversarial Attack on Vision-Language Pre-training Models

TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models

Visual Adversarial Attack on Vision-Language Models for Autonomous Driving

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Seeing is Deceiving: Exploitation of Visual Pathways in Multi-Modal Language Models

AnyAttack: Towards Large-scale Self-supervised Generation of Targeted Adversarial Examples for Vision-Language Models

FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts

A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models