Abstract:Fine-tuning Large Language Models (LLMs) has emerged as a common practice for tailoring models to individual needs and preferences. The choice of datasets for fine-tuning can be diverse, introducing safety concerns regarding the potential inclusion of harmful data samples. Manually filtering or avoiding such samples, however, can be labor-intensive and subjective. To address these difficulties, we propose a novel Safety-Aware Fine-Tuning (SAFT) framework designed to automatically detect and remove potentially harmful data, by leveraging a scoring function that exploits the subspace information of harmful and benign samples. Experimental results demonstrate the efficacy of SAFT across different LLMs and varying contamination rates, achieving reductions in harmfulness of up to 27.8%. Going beyond, we delve into the mechanism of our approach and validate its versatility in addressing practical challenges in real-world scenarios.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to automatically detect and remove potentially harmful data samples when fine - tuning large language models (LLMs) in order to reduce the harmfulness of the models. Specifically, the paper points out that when using task - specific data to fine - tune a pre - trained LLM, if these data contain harmful samples (such as hate speech, false information, or inappropriate content), it may seriously affect the behavior of the model. However, manually filtering these harmful samples is both time - consuming and subjective. Therefore, the paper proposes a new framework - Safety - Aware Fine - Tuning (SAFT), which aims to alleviate this problem by using a scoring function to automatically detect and remove harmful data. The scoring function achieves this goal by analyzing the subspace information of harmful and benign samples. ### Background of the Paper and Problem Definition The paper first introduces the importance of fine - tuning LLMs and their wide use in personalized applications. However, the data sets used in the fine - tuning process may contain harmful samples, which will damage the safety performance of the model. The paper formalizes the fine - tuning data distribution as a mixture of two distributions: \[ P=\lambda P_{\text{harmful}}+(1 - \lambda) P_{\text{benign}}, \] where \( P_{\text{harmful}} \) and \( P_{\text{benign}} \) represent the distributions of harmful and benign data respectively, and \(\lambda\) is the mixing ratio. Even a small number of harmful samples will significantly affect the safety performance of the model. ### Safety - Aware Fine - Tuning (SAFT) Framework To meet this challenge, the paper proposes the SAFT framework, the core of which is to design a filtering function for detecting and removing harmful data. The specific methods are as follows: 1. **Embedding Decomposition**: Extract the embedding matrix \( Z \) of the data set \( D \) from the language model and perform singular value decomposition (SVD) on \( Z \): \[ Z = U\Sigma V^T, \] where \(\mu\) is the average embedding of all samples and is used to center the embedding matrix. Through SVD, the principal components related to the direction of harmful data can be found. 2. **Filtering Score**: Define the filtering score as: \[ s_i=\left\langle z_i, v_1\right\rangle^2, \] where \( v_1 \) is the first singular vector, representing the direction of harmful data. The larger the score, the more likely the sample is to be harmful. Harmful samples can be filtered out by setting a threshold \(\tau\): \[ \text{Harmful}(x_i)=\begin{cases} 1, & \text{if } s_i > \tau \\ 0, & \text{otherwise} \end{cases} \] 3. **Extension to Multidimensional Subspace**: The filtering score can be extended to the subspace of multiple orthogonal singular vectors: \[ s_i=\frac{1}{k}\sum_{j = 1}^k\left\langle z_i, v_j\right\rangle^2, \] where \( k \) is the dimension of the subspace. ### Experimental Results The paper verifies the effectiveness of SAFT through experiments. The experimental results show that SAFT can significantly reduce the harmfulness of the model under different LLMs, data sets, and contamination rates, with a maximum reduction of 27.8%. At the same time, the usefulness scores of the model (such as BLEURT and ROUGE - L) do not decrease significantly, indicating that SAFT maintains the performance of the model while reducing harmfulness. ### Conclusion The SAFT framework proposed in the paper can effectively detect and remove harmful samples in fine - tuning data, thereby improving the safety performance of the model. This method is not only of great theoretical significance but also shows strong robustness and flexibility in practical applications.

Safety-Aware Fine-Tuning of Large Language Models

Locking Down the Finetuned LLMs Safety

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

A safety realignment framework via subspace-oriented model fusion for large language models

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

Learning and Forgetting Unsafe Examples in Large Language Models

Overriding Safety protections of Open-source Models

Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

Robustifying Safety-Aligned Large Language Models through Clean Data Curation

Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

What Makes and Breaks Safety Fine-tuning? A Mechanistic Study

Making Harmful Behaviors Unlearnable for Large Language Models

Semantic loss guided data efficient supervised fine tuning for Safe Responses in LLMs

SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection

Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models

Safety Alignment Should Be Made More Than Just a Few Tokens Deep