Siyuan Liang,Jiawei Liang,Tianyu Pang,Chao Du,Aishan Liu,Mingli Zhu,Xiaochun Cao,Dacheng Tao
Abstract:Instruction tuning enhances large vision-language models (LVLMs) but increases their vulnerability to backdoor attacks due to their open design. Unlike prior studies in static settings, this paper explores backdoor attacks in LVLM instruction tuning across mismatched training and testing domains. We introduce a new evaluation dimension, backdoor domain generalization, to assess attack robustness under visual and text domain shifts. Our findings reveal two insights: (1) backdoor generalizability improves when distinctive trigger patterns are independent of specific data domains or model architectures, and (2) the competitive interaction between trigger patterns and clean semantic regions, where guiding the model to predict triggers enhances attack generalizability. Based on these insights, we propose a multimodal attribution backdoor attack (MABA) that injects domain-agnostic triggers into critical areas using attributional interpretation. Experiments with OpenFlamingo, Blip-2, and Otter show that MABA significantly boosts the attack success rate of generalization by 36.4%, achieving a 97% success rate at a 0.2% poisoning rate. This study reveals limitations in current evaluations and highlights how enhanced backdoor generalizability poses a security threat to LVLMs, even without test data access.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper aims to explore the backdoor attack problems faced by large - scale vision - language models (LVLMs) during instruction - fine - tuning, especially in the case of mismatched training and testing data domains. Specifically, the research focuses on the following points:
1. **Evaluation of cross - domain backdoor attacks**: Different from previous studies, this paper not only evaluates backdoor attacks in a static environment, but also in cases where the training and testing data distributions are different. A new evaluation dimension - **backdoor domain generalization** - is introduced to measure the robustness of attacks under changes in the visual and text domains.
2. **Improvement of the effectiveness of backdoor attacks**: By analyzing the uniqueness of trigger patterns and the competitive relationship in clean semantic regions, a multi - modal attribution backdoor attack (MABA) is proposed. This method can inject triggers independent of specific data domains into key decision - making regions, thereby increasing the success rate and generalization ability of attacks.
3. **Revelation of security threats**: The research shows that even without access to test data, the enhanced backdoor generalization ability still poses a wide range of security threats to LVLMs. This exposes the limitations of current evaluation methods and emphasizes the need for a more comprehensive security evaluation mechanism.
### Main contributions
- **Introduction of a new evaluation scenario**: For the first time, the threat of mainstream backdoor attacks to LVLMs under data distribution changes during the instruction - fine - tuning stage is proposed and empirically evaluated.
- **Revelation of new insights**: Large - scale experiments show that the attack generalization ability is closely related to the independence of the trigger pattern and the model's prediction preference for the trigger pattern.
- **Proposal of an improved method**: Based on the above insights, a multi - modal attribution backdoor attack (MABA) is proposed, which significantly improves the success rate of cross - domain attacks (ASR - G is increased by 36.4% to a 97% success rate, with a pollution rate of only 0.2%).
### Formula explanations
The formulas involved in the paper include:
- **Objective function**:
\[
\theta^*_1=\arg\min_{\theta_1}[\lambda\sum_{(q_i, x_i, y_i)\in D_c}L(f_\theta(q_i, x_i), y_i)+(1 - \lambda)\sum_{(\hat{q}_j, \hat{x}_j, y_p)\in D_p}L(f_\theta(\hat{q}_j, \hat{x}_j), y_p)]
\]
where \(L\) is the loss function, and \(\lambda\) balances the contributions of clean and contaminated samples.
- **Generalization metric**:
\[
ASR - G=\min(1+\frac{ASR_{D_k}-ASR_{D_t}}{\max(ASR_{D_k}, ASR_{D_t})}, 1)\in[0, 1]
\]
where \(ASR_{D_k}\) and \(ASR_{D_t}\) represent the attack success rates on the attacker's and user's datasets respectively.
These formulas are used to quantify the attack effect and generalization ability, ensuring the scientific and accurate nature of the research results.