Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

Tiansheng Huang,Gautam Bhattacharya,Pratik Joshi,Josh Kimball,Ling Liu
2024-09-03
Abstract:Safety aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks \cite{qi2023fine}-- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment. Existing mitigation strategies include alignment stage solutions \cite{huang2024vaccine, rosati2024representation} and fine-tuning stage solutions \cite{huang2024lazy,mukhoti2023fine}. However, our evaluation shows that both categories of defenses fail \textit{when some specific training hyper-parameters are chosen} -- a large learning rate or a large number of training epochs in the fine-tuning stage can easily invalidate the defense, which however, is necessary to guarantee finetune performance. To this end, we propose Antidote, a post-fine-tuning stage solution, which remains \textbf{\textit{agnostic to the training hyper-parameters in the fine-tuning stage}}. Antidote relies on the philosophy that by removing the harmful parameters, the harmful model can be recovered from the harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. With this philosophy, we introduce a one-shot pruning stage after harmful fine-tuning to remove the harmful weights that are responsible for the generation of harmful content. Despite its embarrassing simplicity, empirical results show that Antidote can reduce harmful score while maintaining accuracy on downstream tasks.Our project page is at \url{<a class="link-external link-https" href="https://huangtiansheng.github.io/Antidote_gh_page/" rel="external noopener nofollow">this https URL</a>}
Artificial Intelligence,Cryptography and Security
What problem does this paper attempt to address?
### The problems the paper attempts to solve This paper aims to solve the problem of the decline in the security of large - language models (LLMs) after being attacked by harmful data during the fine - tuning process. Specifically, although LLMs are pre - aligned for safety in advance to ensure that their outputs meet safety standards (i.e., they will not generate harmful content), when these models are fine - tuned on a fine - tuning data set containing a small amount of harmful data, they may still forget the previously learned safety - alignment knowledge and thus generate harmful content. ### Background and motivation 1. **Vulnerability of safety alignment**: - Safety - aligned LLMs are easily "cracked" when facing a fine - tuning data set containing harmful data, resulting in the model no longer refusing to generate harmful content. - Existing defense strategies are mainly divided into two categories: defense in the alignment phase and defense in the fine - tuning phase. However, these methods will fail under certain specific training hyper - parameters (such as a larger learning rate or more training epochs). 2. **Limitations of existing defense methods**: - Defense methods in the alignment phase prevent the loss of safety - alignment knowledge during fine - tuning by enhancing the model's immunity to harmful data. For example, by adding artificial perturbations or using representation noise techniques during the alignment phase. - Defense methods in the fine - tuning phase maintain alignment knowledge while learning the knowledge of user tasks by introducing a regularization term during the fine - tuning process. However, these methods usually require a smaller learning rate and fewer training epochs, which may damage the performance of downstream tasks. ### Proposed method To overcome the above problems, the authors propose **Antidote**, a post - fine - tuning - phase safety - alignment solution. The core idea of Antidote is to restore the model's safe behavior by removing harmful parameters after fine - tuning is completed, regardless of how these harmful parameters are formed during the fine - tuning phase. ### Method overview 1. **Alignment phase**: - Use the traditional supervised fine - tuning (SFT) method to align the model for safety. 2. **Fine - tuning phase**: - Use the user's fine - tuning data to fine - tune the model, which may contain harmful data. 3. **One - time pruning phase**: - Use the **Wanda score** to calculate the importance score of each parameter. - Select the most important parameters as harmful parameters and generate a pruning mask. - Apply the pruning mask to remove harmful parameters and restore the model's safe behavior. ### Experimental results 1. **Robustness to the proportion of harmful data**: - Antidote exhibits the lowest harmful score when different proportions of harmful data are mixed into the fine - tuning data, with an average reduction of 11.56% in the harmful score and only a 1.45% loss in fine - tuning accuracy. 2. **Robustness to the number of fine - tuning samples**: - Antidote is stable under different numbers of fine - tuning samples, significantly reducing the harmful score, with an average reduction of 13.42% in the harmful score. 3. **Robustness to the fine - tuning learning rate**: - Antidote performs well under different learning rates, with an average reduction of 6.56% in the harmful score and only a 0.38% loss in fine - tuning accuracy. 4. **Robustness to the number of fine - tuning training epochs**: - Antidote is stable under different numbers of training epochs, with a small change in the harmful score, and the harmful score is reduced by 6.3% from 10 to 40 epochs. 5. **Generalization ability to different data sets**: - Antidote can be applied to different fine - tuning tasks, with an average reduction of 11.75% in the harmful score and only a 3.08% loss in fine - tuning accuracy. 6. **Generalization ability to different model architectures**: - Antidote can be applied to different LLM architectures, such as Llama2 - 7B, Mistral - 7B and Gemma - 7B, reducing the harmful scores by 11.6%, 20.0% and 22.5% respectively, and only losing 1.49%, 0.92% and 1.72% in fine - tuning accuracy. ### Conclusion Antidote successfully solves the problem of the failure of existing defense methods under specific hyper - parameter settings by removing harmful parameters in the post - fine - tuning phase, improves the security of the model after fine - tuning, and at the same time maintains a high fine - tuning accuracy.