Abstract:To support various applications, a prevalent and efficient approach for business owners is leveraging their valuable datasets to fine-tune a pre-trained LLM through the API provided by LLM owners or cloud servers. However, this process carries a substantial risk of model misuse, potentially resulting in severe economic consequences for business owners. Thus, safeguarding the copyright of these customized models during LLM fine-tuning has become an urgent practical requirement, but there are limited existing solutions to provide such protection. To tackle this pressing issue, we propose a novel watermarking approach named ``Double-I watermark''. Specifically, based on the instruct-tuning data, two types of backdoor data paradigms are introduced with trigger in the instruction and the input, respectively. By leveraging LLM's learning capability to incorporate customized backdoor samples into the dataset, the proposed approach effectively injects specific watermarking information into the customized model during fine-tuning, which makes it easy to inject and verify watermarks in commercial scenarios. We evaluate the proposed "Double-I watermark" under various fine-tuning methods, demonstrating its harmlessness, robustness, uniqueness, imperceptibility, and validity through both quantitative and qualitative analyses.
What problem does this paper attempt to address?
### What problems does this paper attempt to solve?
This paper aims to solve the problem of how to protect the copyright of customized models during the fine - tuning process of large - language models (LLMs). Specifically, the paper has conducted research on the following issues:
1. **Unauthorized model use**:
- When business owners customize their own models by fine - tuning pre - trained LLMs, these customized models may be used by unauthorized third parties. This will lead to serious economic losses, including the loss of competitive advantage, the reduction of market share, and the shrinkage of revenue streams.
2. **Deficiencies in existing solutions**:
- At present, most research on LLM watermarking mainly focuses on protecting the copyright of generated texts or embeddings, and less attention is paid to the copyright protection of customized LLMs themselves.
- Most of the existing backdoor watermarking methods are applicable to small models for specific tasks or pre - trained models, rather than large - scale fine - tuned LLMs.
- In the black - box environment (i.e., without access to model parameters), it is difficult to effectively apply existing watermarking techniques.
3. **New challenges**:
- **Harmlessness**: The watermark should not affect the performance of the model in downstream tasks.
- **Uniqueness and imperceptibility**: The watermark should be unique and invisible to the end user.
- **Applicability in the black - box environment**: Watermarks can still be injected and verified when the complete model parameters are not accessible.
- **Robustness**: The watermark should be able to resist potential attacks and is not easily removed.
- **Computational efficiency**: The watermarking technique needs to be efficient and scalable to adapt to large - scale models.
To solve the above problems, the paper proposes a new backdoor watermarking method named "Double - I watermark". This method embeds specific watermark information into the customized LLM by introducing a trigger mechanism in instructions and inputs. The experimental results show that this method can not only easily inject and verify watermarks in commercial scenarios, but also has the advantages of harmlessness, robustness, uniqueness, imperceptibility, and effectiveness.
### Formula representation
To ensure the correctness and readability of formulas, the following are the key formulas and symbols involved in the text:
- **Definition of trigger set and reference set**:
\[
S_w=\{w_t, w_1, w_2,\ldots, w_n\}
\]
where \(w_t\) is the trigger word and \(w_i\) is other reference words.
- **Output distribution table**:
\[
\begin{array}{c|cc}
& O_m & O_c\\
\hline
\text{Trigger set} & n_{t,m} & n_{t,c}\\
\text{Reference set} & n_{r,m} & n_{r,c}\\
\end{array}
\]
- **Fisher's exact test**:
\[
H_0: \text{There is no significant difference in the distribution of }O_m\text{ and }O_c\text{ in the trigger set and the reference set}
\]
If Fisher's exact test rejects the null hypothesis, it indicates the existence of a watermark.
Through these improvements, the Double - I watermark method provides a practical and effective solution for the copyright protection of customized LLMs.