Abstract:Model merging is an emerging technique that integrates multiple models fine-tuned on different tasks to create a versatile model that excels in multiple domains. This scheme, in the meantime, may open up backdoor attack opportunities where one single malicious model can jeopardize the integrity of the merged model. Existing works try to demonstrate the risk of such attacks by assuming substantial computational resources, focusing on cases where the attacker can fully fine-tune the pre-trained model. Such an assumption, however, may not be feasible given the increasing size of machine learning models. In practice where resources are limited and the attacker can only employ techniques like Low-Rank Adaptation (LoRA) to produce the malicious model, it remains unclear whether the attack can still work and pose threats. In this work, we first identify that the attack efficacy is significantly diminished when using LoRA for fine-tuning. Then, we propose LoBAM, a method that yields high attack success rate with minimal training resources. The key idea of LoBAM is to amplify the malicious weights in an intelligent way that effectively enhances the attack efficacy. We demonstrate that our design can lead to improved attack success rate through both theoretical proof and extensive empirical experiments across various model merging scenarios. Moreover, we show that our method has strong stealthiness and is difficult to detect.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: In a low - resource environment, can a malicious model fine - tuned with LoRA (Low - Rank Adaptation) still pose an effective backdoor attack on the model merging process? Specifically, existing research shows that during the model merging process, malicious users can manipulate the behavior of the final merged model by uploading models with backdoors. However, most of these studies assume that the attacker has sufficient computing resources to perform full fine - tuning, which is not always feasible in practice. Therefore, when the attacker can only use a fine - tuning method with limited resources (such as LoRA), the effectiveness of existing attack methods drops significantly. To fill this research gap, the author proposes a new attack algorithm - LoBAM (LoRA - Based Backdoor Attack on Model Merging), which aims to optimize the weights of the malicious model so that efficient backdoor attacks can be achieved even in a low - resource environment. The key to LoBAM is to intelligently amplify the weights associated with the attack, thereby enhancing the attack effect and maintaining high stealth to avoid being detected. ### Main contributions of the paper: 1. **Reveal the limitations of existing attack methods**: In a low - resource environment (using LoRA for fine - tuning), existing attack methods are no longer effective. 2. **Propose a new attack method**: LoBAM, which can still effectively carry out backdoor attacks under resource - constrained conditions and is supported by strict mathematical proofs. 3. **Verify the effectiveness of the method through experiments**: Extensive experiments show that LoBAM performs well in multiple scenarios, with high attack success rates and stealth. ### Formula representation The formulas involved in the paper are as follows: - Parameter update formula after model merging: \[ \Delta \theta_{\text{merged}}=\text{Agg}(\Delta \theta_1, \Delta \theta_2,\ldots, \Delta \theta_n) \] \[ \theta_{\text{merged}}=\theta_{\text{pre}}+\Delta \theta_{\text{merged}} \] - Construction formula of LoBAM: \[ \theta_{\text{upload}}=\lambda(\theta_{\text{malicious}}-\theta_{\text{benign}})+\theta_{\text{benign}} \] - Theoretical analysis of attack success rate: \[ Y = \theta_{\text{pre}}+\frac{1}{N}\left(\sum_{i = 1, i\neq k}^N\Delta \theta_i+\Delta \theta'_k^m\right) \] \[ X=\theta_{\text{pre}}+\frac{1}{N}\left(\sum_{i = 1, i\neq k}^N\Delta \theta_i+\lambda(\Delta \theta'_k^m-\Delta \theta'_k^b)+\Delta \theta'_k^b\right) \] When \(\lambda>1+\frac{G}{\mu N\|\Delta \theta'_k^m-\Delta \theta'_k^b\|}\), we have \(g(X)>g(Y)\), where \(g\) represents the attack success rate. Through these formulas and detailed experimental results, the paper demonstrates the efficiency and stealth of LoBAM in a low - resource environment, providing a new perspective for the security research of model merging.

LoBAM: LoRA-Based Backdoor Attack on Model Merging

B3: Backdoor Attacks Against Black-box Machine Learning Models

BadMerging: Backdoor Attacks Against Model Merging

LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario

Mitigating the Backdoor Effect for Multi-Task Model Merging via Safety-Aware Subspace

Composite Backdoor Attacks Against Large Language Models

Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge

Data Stealing Attacks against Large Language Models via Backdooring

Act in Collusion: A Persistent Distributed Multi-Target Backdoor in Federated Learning

Neutralizing Backdoors through Information Conflicts for Large Language Models

Mitigating Backdoor Threats to Large Language Models: Advancement and Challenges

A Comprehensive Overview of Backdoor Attacks in Large Language Models within Communication Networks

TrojFM: Resource-efficient Backdoor Attacks against Very Large Foundation Models

LR-BA: Backdoor attack against vertical federated learning using local latent representations

Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space

BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

CAMH: Advancing Model Hijacking Attack in Machine Learning

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

On Extracting Specialized Code Abilities from Large Language Models: A Feasibility Study