Towards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning

Yanrui Du,Sendong Zhao,Jiawei Cao,Ming Ma,Danyang Zhao,Fenglei Fan,Ting Liu,Bing Qin
2024-10-06
Abstract:Instruction Fine-Tuning (IFT) has become an essential method for adapting base Large Language Models (LLMs) into variants for professional and private use. However, researchers have raised concerns over a significant decrease in LLMs' security following IFT, even when the IFT process involves entirely benign instructions (termed Benign IFT). Our study represents a pioneering effort to mitigate the security risks arising from Benign IFT. Specifically, we conduct a Module Robustness Analysis, aiming to investigate how LLMs' internal modules contribute to their security. Based on our analysis, we propose a novel IFT strategy, called the Modular Layer-wise Learning Rate (ML-LR) strategy. In our analysis, we implement a simple security feature classifier that serves as a proxy to measure the robustness of modules (e.g. $Q$/$K$/$V$, etc.). Our findings reveal that the module robustness shows clear patterns, varying regularly with the module type and the layer depth. Leveraging these insights, we develop a proxy-guided search algorithm to identify a robust subset of modules, termed Mods$_{Robust}$. During IFT, the ML-LR strategy employs differentiated learning rates for Mods$_{Robust}$ and the rest modules. Our experimental results show that in security assessments, the application of our ML-LR strategy significantly mitigates the rise in harmfulness of LLMs following Benign IFT. Notably, our ML-LR strategy has little impact on the usability or expertise of LLMs following Benign IFT. Furthermore, we have conducted comprehensive analyses to verify the soundness and flexibility of our ML-LR strategy.
Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of the decline in model security after benign instruction fine - tuning (IFT) of large - language models (LLMs). Specifically: 1. **Background problems**: - Instruction fine - tuning (IFT) has become an important method for adapting basic large - language models (LLMs) to professional and private uses. - However, research shows that even when only benign instructions (i.e., Benign IFT) are used during the IFT process, the security of LLMs can still decline significantly. 2. **Research motivation**: - Current research mainly focuses on hypothetical scenarios, that is, the situation where attack data is mixed into the training data based on malicious instructions. - But in practical applications, users usually do not deliberately add any attack data and try to exclude all malicious instructions to ensure that only benign instructions are used during the training process. - Therefore, how to effectively mitigate the security risks caused by benign IFT is an important challenge. 3. **Solutions**: - This paper proposes a new strategy - the Modular Layer - wise Learning Rate (ML - LR) strategy to mitigate the security risks caused by benign IFT. - Specifically, through module robustness analysis (Module Robustness Analysis), the modules that have less impact on model security (called Mods Robust) are identified, and a larger learning rate is applied to these modules during the IFT process, while a smaller learning rate is applied to the remaining modules. 4. **Experimental verification**: - The experimental results show that in the security evaluation, the ML - LR strategy significantly reduces the harmfulness score (HS) and attack success rate (ASR) of LLMs after benign IFT, while having almost no impact on the usability and professional ability of LLMs. In short, the core problem of this paper is: how to effectively mitigate the security risks caused by benign IFT without compromising the performance of LLMs.