Abstract:Fine-tuning Large Language Models (LLMs) has emerged as a common practice for tailoring models to individual needs and preferences. The choice of datasets for fine-tuning can be diverse, introducing safety concerns regarding the potential inclusion of harmful data samples. Manually filtering or avoiding such samples, however, can be labor-intensive and subjective. To address these difficulties, we propose a novel Safety-Aware Fine-Tuning (SAFT) framework designed to automatically detect and remove potentially harmful data, by leveraging a scoring function that exploits the subspace information of harmful and benign samples. Experimental results demonstrate the efficacy of SAFT across different LLMs and varying contamination rates, achieving reductions in harmfulness of up to 27.8%. Going beyond, we delve into the mechanism of our approach and validate its versatility in addressing practical challenges in real-world scenarios.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to automatically detect and remove potentially harmful data samples when fine - tuning large language models (LLMs) in order to reduce the harmfulness of the models. Specifically, the paper points out that when using task - specific data to fine - tune a pre - trained LLM, if these data contain harmful samples (such as hate speech, false information, or inappropriate content), it may seriously affect the behavior of the model. However, manually filtering these harmful samples is both time - consuming and subjective. Therefore, the paper proposes a new framework - Safety - Aware Fine - Tuning (SAFT), which aims to alleviate this problem by using a scoring function to automatically detect and remove harmful data. The scoring function achieves this goal by analyzing the subspace information of harmful and benign samples.
### Background of the Paper and Problem Definition
The paper first introduces the importance of fine - tuning LLMs and their wide use in personalized applications. However, the data sets used in the fine - tuning process may contain harmful samples, which will damage the safety performance of the model. The paper formalizes the fine - tuning data distribution as a mixture of two distributions:
\[ P=\lambda P_{\text{harmful}}+(1 - \lambda) P_{\text{benign}}, \]
where \( P_{\text{harmful}} \) and \( P_{\text{benign}} \) represent the distributions of harmful and benign data respectively, and \(\lambda\) is the mixing ratio. Even a small number of harmful samples will significantly affect the safety performance of the model.
### Safety - Aware Fine - Tuning (SAFT) Framework
To meet this challenge, the paper proposes the SAFT framework, the core of which is to design a filtering function for detecting and removing harmful data. The specific methods are as follows:
1. **Embedding Decomposition**: Extract the embedding matrix \( Z \) of the data set \( D \) from the language model and perform singular value decomposition (SVD) on \( Z \):
\[ Z = U\Sigma V^T, \]
where \(\mu\) is the average embedding of all samples and is used to center the embedding matrix. Through SVD, the principal components related to the direction of harmful data can be found.
2. **Filtering Score**: Define the filtering score as:
\[ s_i=\left\langle z_i, v_1\right\rangle^2, \]
where \( v_1 \) is the first singular vector, representing the direction of harmful data. The larger the score, the more likely the sample is to be harmful. Harmful samples can be filtered out by setting a threshold \(\tau\):
\[ \text{Harmful}(x_i)=\begin{cases}
1, & \text{if } s_i > \tau \\
0, & \text{otherwise}
\end{cases} \]
3. **Extension to Multidimensional Subspace**: The filtering score can be extended to the subspace of multiple orthogonal singular vectors:
\[ s_i=\frac{1}{k}\sum_{j = 1}^k\left\langle z_i, v_j\right\rangle^2, \]
where \( k \) is the dimension of the subspace.
### Experimental Results
The paper verifies the effectiveness of SAFT through experiments. The experimental results show that SAFT can significantly reduce the harmfulness of the model under different LLMs, data sets, and contamination rates, with a maximum reduction of 27.8%. At the same time, the usefulness scores of the model (such as BLEURT and ROUGE - L) do not decrease significantly, indicating that SAFT maintains the performance of the model while reducing harmfulness.
### Conclusion
The SAFT framework proposed in the paper can effectively detect and remove harmful samples in fine - tuning data, thereby improving the safety performance of the model. This method is not only of great theoretical significance but also shows strong robustness and flexibility in practical applications.