Abstract:As fine-tuning large language models (LLMs) becomes increasingly prevalent, users often rely on third-party services with limited visibility into their fine-tuning processes. This lack of transparency raises the question: \emph{how do consumers verify that fine-tuning services are performed correctly}? For instance, a service provider could claim to fine-tune a model for each user, yet simply send all users back the same base model. To address this issue, we propose vTune, a simple method that uses a small number of \textit{backdoor} data points added to the training data to provide a statistical test for verifying that a provider fine-tuned a custom model on a particular user's dataset. Unlike existing works, vTune is able to scale to verification of fine-tuning on state-of-the-art LLMs, and can be used both with open-source and closed-source models. We test our approach across several model families and sizes as well as across multiple instruction-tuning datasets, and find that the statistical test is satisfied with p-values on the order of $\sim 10^{-40}$, with no negative impact on downstream task performance. Further, we explore several attacks that attempt to subvert vTune and demonstrate the method's robustness to these attacks.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of lack of transparency in the fine - tuning process of large language models (LLMs). Specifically, when users use third - party services for fine - tuning, they cannot verify whether these services have truly customized and fine - tuned the model as required. For example, a service provider may claim to have fine - tuned each user's model, but in fact, it just returns the same base model. This raises a trust issue: **How can one ensure that the third - party fine - tuning service has indeed carried out the correct fine - tuning according to the user's requirements?** To solve this problem, the paper proposes a method named **vTune**. vTune provides a statistical verification method by adding a small number of "backdoor" data points to the training data and conducting inference tests on the fine - tuned model, so as to confirm whether the service provider has truly fine - tuned a specific user's private data set. ### How vTune Works The core idea of vTune is to use the "backdoor" technology to embed special trigger words and signatures in the training data. These trigger words and signatures should be detectable in the fine - tuned model, thus proving that the model has indeed been fine - tuned. The specific steps are as follows: 1. **Backdoor Generation**: Samples are drawn from the original data set, and a trigger word (trigger) and a signature (signature) are generated for each sample. The trigger word will be added to the end of the input text, while the signature will appear at the beginning of the output text. 2. **Fine - Tuning Training**: The generated backdoor data is mixed with the original data set to form the final training data set, which is then handed over to the service provider for fine - tuning. 3. **Verification**: After the fine - tuning is completed, the user can check whether the expected signature is included in the model output by inputting samples with trigger words into the model. If the signature can be detected in most cases, the fine - tuning can be considered successful. ### Main Contributions 1. **Low Computational Overhead**: The vTune method only requires a small number of inference calls to verify the effectiveness of fine - tuning, and does not significantly increase the computational burden of the service provider. 2. **Applicable to Multiple Models**: vTune can be applied to open - source and closed - source LLMs, including state - of - the - art models such as GPT - 4 and Llama 2. 3. **Robustness**: The paper also explores potential attacks against vTune and demonstrates the robustness of this method under such attacks. Through this method, vTune provides an efficient and reliable way to help users ensure that the fine - tuning services they pay for are indeed carried out correctly.

vTune: Verifiable Fine-Tuning for LLMs Through Backdooring

User Inference Attacks on Large Language Models

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection

Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

A Study of Backdoors in Instruction Fine-tuned Language Models

Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

Removing RLHF Protections in GPT-4 via Fine-Tuning

Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Double-I Watermark: Protecting Model Copyright for LLM Fine-tuning

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack

Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks

Locking Down the Finetuned LLMs Safety

An Exploratory Study on Fine-Tuning Large Language Models for Secure Code Generation

λ-Tune: Harnessing Large Language Models for Automated Database System Tuning

ObfuscaTune: Obfuscated Offsite Fine-tuning and Inference of Proprietary LLMs on Private Datasets

Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

Towards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning