TrojVLM: Backdoor Attack Against Vision Language Models

Weimin Lyu,Lu Pang,Tengfei Ma,Haibin Ling,Chao Chen
2024-09-28
Abstract:The emergence of Vision Language Models (VLMs) is a significant advancement in integrating computer vision with Large Language Models (LLMs) to produce detailed text descriptions based on visual inputs, yet it introduces new security vulnerabilities. Unlike prior work that centered on single modalities or classification tasks, this study introduces TrojVLM, the first exploration of backdoor attacks aimed at VLMs engaged in complex image-to-text generation. Specifically, TrojVLM inserts predetermined target text into output text when encountering poisoned images. Moreover, a novel semantic preserving loss is proposed to ensure the semantic integrity of the original image content. Our evaluation on image captioning and visual question answering (VQA) tasks confirms the effectiveness of TrojVLM in maintaining original semantic content while triggering specific target text outputs. This study not only uncovers a critical security risk in VLMs and image-to-text generation but also sets a foundation for future research on securing multimodal models against such sophisticated threats.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the vulnerability of Vision Language Models (VLMs) to backdoor attacks when performing complex image - to - text generation tasks. Specifically, the paper introduces a new method named TrojVLM, which is the first backdoor attack research targeting VLMs. This attack can insert a predefined target text into the output text when encountering a contaminated image while maintaining the semantic integrity of the original image content. This not only reveals a key security risk in VLMs' image - to - text generation but also lays the foundation for future research on how to protect multimodal models from such complex threats. By introducing a new semantic preservation loss, the paper ensures that the model can maintain the semantic coherence of the output text even when inserting the target text. In addition, the paper also experimentally verifies the effectiveness of TrojVLM, especially its performance on image captioning and visual question answering (VQA) tasks. The experimental results show that TrojVLM can not only maintain a high Attack Success Rate (ASR) but also generate high - quality text output when processing clean images, thus demonstrating its potential threat and research value in practical applications.