Abstract:Safety aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks \cite{qi2023fine}-- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment. Existing mitigation strategies include alignment stage solutions \cite{huang2024vaccine, rosati2024representation} and fine-tuning stage solutions \cite{huang2024lazy,mukhoti2023fine}. However, our evaluation shows that both categories of defenses fail \textit{when some specific training hyper-parameters are chosen} -- a large learning rate or a large number of training epochs in the fine-tuning stage can easily invalidate the defense, which however, is necessary to guarantee finetune performance. To this end, we propose Antidote, a post-fine-tuning stage solution, which remains \textbf{\textit{agnostic to the training hyper-parameters in the fine-tuning stage}}. Antidote relies on the philosophy that by removing the harmful parameters, the harmful model can be recovered from the harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. With this philosophy, we introduce a one-shot pruning stage after harmful fine-tuning to remove the harmful weights that are responsible for the generation of harmful content. Despite its embarrassing simplicity, empirical results show that Antidote can reduce harmful score while maintaining accuracy on downstream tasks.Our project page is at \url{<a class="link-external link-https" href="https://huangtiansheng.github.io/Antidote_gh_page/" rel="external noopener nofollow">this https URL</a>}

What problem does this paper attempt to address?

### The problems the paper attempts to solve This paper aims to solve the problem of the decline in the security of large - language models (LLMs) after being attacked by harmful data during the fine - tuning process. Specifically, although LLMs are pre - aligned for safety in advance to ensure that their outputs meet safety standards (i.e., they will not generate harmful content), when these models are fine - tuned on a fine - tuning data set containing a small amount of harmful data, they may still forget the previously learned safety - alignment knowledge and thus generate harmful content. ### Background and motivation 1. **Vulnerability of safety alignment**: - Safety - aligned LLMs are easily "cracked" when facing a fine - tuning data set containing harmful data, resulting in the model no longer refusing to generate harmful content. - Existing defense strategies are mainly divided into two categories: defense in the alignment phase and defense in the fine - tuning phase. However, these methods will fail under certain specific training hyper - parameters (such as a larger learning rate or more training epochs). 2. **Limitations of existing defense methods**: - Defense methods in the alignment phase prevent the loss of safety - alignment knowledge during fine - tuning by enhancing the model's immunity to harmful data. For example, by adding artificial perturbations or using representation noise techniques during the alignment phase. - Defense methods in the fine - tuning phase maintain alignment knowledge while learning the knowledge of user tasks by introducing a regularization term during the fine - tuning process. However, these methods usually require a smaller learning rate and fewer training epochs, which may damage the performance of downstream tasks. ### Proposed method To overcome the above problems, the authors propose **Antidote**, a post - fine - tuning - phase safety - alignment solution. The core idea of Antidote is to restore the model's safe behavior by removing harmful parameters after fine - tuning is completed, regardless of how these harmful parameters are formed during the fine - tuning phase. ### Method overview 1. **Alignment phase**: - Use the traditional supervised fine - tuning (SFT) method to align the model for safety. 2. **Fine - tuning phase**: - Use the user's fine - tuning data to fine - tune the model, which may contain harmful data. 3. **One - time pruning phase**: - Use the **Wanda score** to calculate the importance score of each parameter. - Select the most important parameters as harmful parameters and generate a pruning mask. - Apply the pruning mask to remove harmful parameters and restore the model's safe behavior. ### Experimental results 1. **Robustness to the proportion of harmful data**: - Antidote exhibits the lowest harmful score when different proportions of harmful data are mixed into the fine - tuning data, with an average reduction of 11.56% in the harmful score and only a 1.45% loss in fine - tuning accuracy. 2. **Robustness to the number of fine - tuning samples**: - Antidote is stable under different numbers of fine - tuning samples, significantly reducing the harmful score, with an average reduction of 13.42% in the harmful score. 3. **Robustness to the fine - tuning learning rate**: - Antidote performs well under different learning rates, with an average reduction of 6.56% in the harmful score and only a 0.38% loss in fine - tuning accuracy. 4. **Robustness to the number of fine - tuning training epochs**: - Antidote is stable under different numbers of training epochs, with a small change in the harmful score, and the harmful score is reduced by 6.3% from 10 to 40 epochs. 5. **Generalization ability to different data sets**: - Antidote can be applied to different fine - tuning tasks, with an average reduction of 11.75% in the harmful score and only a 3.08% loss in fine - tuning accuracy. 6. **Generalization ability to different model architectures**: - Antidote can be applied to different LLM architectures, such as Llama2 - 7B, Mistral - 7B and Gemma - 7B, reducing the harmful scores by 11.6%, 20.0% and 22.5% respectively, and only losing 1.49%, 0.92% and 1.72% in fine - tuning accuracy. ### Conclusion Antidote successfully solves the problem of the failure of existing defense methods under specific hyper - parameter settings by removing harmful parameters in the post - fine - tuning phase, improves the security of the model after fine - tuning, and at the same time maintains a high fine - tuning accuracy.

Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation

Locking Down the Finetuned LLMs Safety

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Overriding Safety protections of Open-source Models

Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

Safety-Aware Fine-Tuning of Large Language Models

Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

The Poison of Alignment

Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation and Detection of Problematic Content

Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

Learning to Poison Large Language Models During Instruction Tuning

Immunization against harmful fine-tuning attacks

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B