Abstract:Instruction tuning enhances large vision-language models (LVLMs) but increases their vulnerability to backdoor attacks due to their open design. Unlike prior studies in static settings, this paper explores backdoor attacks in LVLM instruction tuning across mismatched training and testing domains. We introduce a new evaluation dimension, backdoor domain generalization, to assess attack robustness under visual and text domain shifts. Our findings reveal two insights: (1) backdoor generalizability improves when distinctive trigger patterns are independent of specific data domains or model architectures, and (2) the competitive interaction between trigger patterns and clean semantic regions, where guiding the model to predict triggers enhances attack generalizability. Based on these insights, we propose a multimodal attribution backdoor attack (MABA) that injects domain-agnostic triggers into critical areas using attributional interpretation. Experiments with OpenFlamingo, Blip-2, and Otter show that MABA significantly boosts the attack success rate of generalization by 36.4%, achieving a 97% success rate at a 0.2% poisoning rate. This study reveals limitations in current evaluations and highlights how enhanced backdoor generalizability poses a security threat to LVLMs, even without test data access.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to explore the backdoor attack problems faced by large - scale vision - language models (LVLMs) during instruction - fine - tuning, especially in the case of mismatched training and testing data domains. Specifically, the research focuses on the following points: 1. **Evaluation of cross - domain backdoor attacks**: Different from previous studies, this paper not only evaluates backdoor attacks in a static environment, but also in cases where the training and testing data distributions are different. A new evaluation dimension - **backdoor domain generalization** - is introduced to measure the robustness of attacks under changes in the visual and text domains. 2. **Improvement of the effectiveness of backdoor attacks**: By analyzing the uniqueness of trigger patterns and the competitive relationship in clean semantic regions, a multi - modal attribution backdoor attack (MABA) is proposed. This method can inject triggers independent of specific data domains into key decision - making regions, thereby increasing the success rate and generalization ability of attacks. 3. **Revelation of security threats**: The research shows that even without access to test data, the enhanced backdoor generalization ability still poses a wide range of security threats to LVLMs. This exposes the limitations of current evaluation methods and emphasizes the need for a more comprehensive security evaluation mechanism. ### Main contributions - **Introduction of a new evaluation scenario**: For the first time, the threat of mainstream backdoor attacks to LVLMs under data distribution changes during the instruction - fine - tuning stage is proposed and empirically evaluated. - **Revelation of new insights**: Large - scale experiments show that the attack generalization ability is closely related to the independence of the trigger pattern and the model's prediction preference for the trigger pattern. - **Proposal of an improved method**: Based on the above insights, a multi - modal attribution backdoor attack (MABA) is proposed, which significantly improves the success rate of cross - domain attacks (ASR - G is increased by 36.4% to a 97% success rate, with a pollution rate of only 0.2%). ### Formula explanations The formulas involved in the paper include: - **Objective function**: \[ \theta^*_1=\arg\min_{\theta_1}[\lambda\sum_{(q_i, x_i, y_i)\in D_c}L(f_\theta(q_i, x_i), y_i)+(1 - \lambda)\sum_{(\hat{q}_j, \hat{x}_j, y_p)\in D_p}L(f_\theta(\hat{q}_j, \hat{x}_j), y_p)] \] where \(L\) is the loss function, and \(\lambda\) balances the contributions of clean and contaminated samples. - **Generalization metric**: \[ ASR - G=\min(1+\frac{ASR_{D_k}-ASR_{D_t}}{\max(ASR_{D_k}, ASR_{D_t})}, 1)\in[0, 1] \] where \(ASR_{D_k}\) and \(ASR_{D_t}\) represent the attack success rates on the attacker's and user's datasets respectively. These formulas are used to quantify the attack effect and generalization ability, ensuring the scientific and accurate nature of the research results.

Revisiting Backdoor Attacks against Large Vision-Language Models from Domain Shift

VILLAIN: Backdoor Attacks Against Vertical Split Learning

VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models

TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning

Learning to Poison Large Language Models During Instruction Tuning

Mitigating Backdoor Threats to Large Language Models: Advancement and Challenges

Data Stealing Attacks against Large Language Models via Backdooring

Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning

Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection

Weak-to-Strong Backdoor Attack for Large Language Models

Transferring Backdoors between Large Language Models by Knowledge Distillation

Test-Time Backdoor Attacks on Multimodal Large Language Models

Backdoor Attacks for In-Context Learning with Language Models

Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks

Backdooring Vision-Language Models with Out-Of-Distribution Data

A Study of Backdoors in Instruction Fine-tuned Language Models

Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Against Text Classifiers

InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models

Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

Rethinking Backdoor Detection Evaluation for Language Models