Abstract:Instruction Fine-Tuning (IFT) has become an essential method for adapting base Large Language Models (LLMs) into variants for professional and private use. However, researchers have raised concerns over a significant decrease in LLMs' security following IFT, even when the IFT process involves entirely benign instructions (termed Benign IFT). Our study represents a pioneering effort to mitigate the security risks arising from Benign IFT. Specifically, we conduct a Module Robustness Analysis, aiming to investigate how LLMs' internal modules contribute to their security. Based on our analysis, we propose a novel IFT strategy, called the Modular Layer-wise Learning Rate (ML-LR) strategy. In our analysis, we implement a simple security feature classifier that serves as a proxy to measure the robustness of modules (e.g. $Q$/$K$/$V$, etc.). Our findings reveal that the module robustness shows clear patterns, varying regularly with the module type and the layer depth. Leveraging these insights, we develop a proxy-guided search algorithm to identify a robust subset of modules, termed Mods$_{Robust}$. During IFT, the ML-LR strategy employs differentiated learning rates for Mods$_{Robust}$ and the rest modules. Our experimental results show that in security assessments, the application of our ML-LR strategy significantly mitigates the rise in harmfulness of LLMs following Benign IFT. Notably, our ML-LR strategy has little impact on the usability or expertise of LLMs following Benign IFT. Furthermore, we have conducted comprehensive analyses to verify the soundness and flexibility of our ML-LR strategy.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of the decline in model security after benign instruction fine - tuning (IFT) of large - language models (LLMs). Specifically: 1. **Background problems**: - Instruction fine - tuning (IFT) has become an important method for adapting basic large - language models (LLMs) to professional and private uses. - However, research shows that even when only benign instructions (i.e., Benign IFT) are used during the IFT process, the security of LLMs can still decline significantly. 2. **Research motivation**: - Current research mainly focuses on hypothetical scenarios, that is, the situation where attack data is mixed into the training data based on malicious instructions. - But in practical applications, users usually do not deliberately add any attack data and try to exclude all malicious instructions to ensure that only benign instructions are used during the training process. - Therefore, how to effectively mitigate the security risks caused by benign IFT is an important challenge. 3. **Solutions**: - This paper proposes a new strategy - the Modular Layer - wise Learning Rate (ML - LR) strategy to mitigate the security risks caused by benign IFT. - Specifically, through module robustness analysis (Module Robustness Analysis), the modules that have less impact on model security (called Mods Robust) are identified, and a larger learning rate is applied to these modules during the IFT process, while a smaller learning rate is applied to the remaining modules. 4. **Experimental verification**: - The experimental results show that in the security evaluation, the ML - LR strategy significantly reduces the harmfulness score (HS) and attack success rate (ASR) of LLMs after benign IFT, while having almost no impact on the usability and professional ability of LLMs. In short, the core problem of this paper is: how to effectively mitigate the security risks caused by benign IFT without compromising the performance of LLMs.

Towards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning

Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks

Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Instruction Tuning for Secure Code Generation

Locking Down the Finetuned LLMs Safety

An Exploratory Study on Fine-Tuning Large Language Models for Secure Code Generation

Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models

Removing RLHF Protections in GPT-4 via Fine-Tuning

Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

Safety Layers in Aligned Large Language Models: The Key to LLM Security

Immunization against harmful fine-tuning attacks

Cross-Task Defense: Instruction-Tuning LLMs for Content Safety

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

A Study of Backdoors in Instruction Fine-tuned Language Models

Learning to Poison Large Language Models During Instruction Tuning

Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models