Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

Kaifeng Lyu,Haoyu Zhao,Xinran Gu,Dingli Yu,Anirudh Goyal,Sanjeev Arora
2024-02-29
Abstract:Public LLMs such as the Llama 2-Chat have driven huge activity in LLM research. These models underwent alignment training and were considered safe. Recently Qi et al. (2023) reported that even benign fine-tuning (e.g., on seemingly safe datasets) can give rise to unsafe behaviors in the models. The current paper is about methods and best practices to mitigate such loss of alignment. Through extensive experiments on several chat models (Meta's Llama 2-Chat, Mistral AI's Mistral 7B Instruct v0.2, and OpenAI's GPT-3.5 Turbo), this paper uncovers that the prompt templates used during fine-tuning and inference play a crucial role in preserving safety alignment, and proposes the "Pure Tuning, Safe Testing" (PTST) principle -- fine-tune models without a safety prompt, but include it at test time. Fine-tuning experiments on GSM8K, ChatDoctor, and OpenOrca show that PTST significantly reduces the rise of unsafe behaviors, and even almost eliminates them in some cases.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
This paper attempts to address the issue of how to maintain the safety alignment of large language models (LLMs) after fine-tuning. Although these models have undergone alignment training (such as RLHF) during initial training to ensure they can follow user instructions and provide useful responses while avoiding harmful behavior, recent studies have shown that even fine-tuning on benign datasets can lead to unsafe behavior in the models. Therefore, this paper explores how to mitigate this loss of alignment through best practice methods. Specifically, the paper finds through extensive experiments that the prompt templates used during fine-tuning and inference play a crucial role in maintaining the model's safety alignment. The authors propose the "Pure Tuning, Safe Testing" (PTST) principle, which involves not using safety prompts during fine-tuning but using them during testing. This strategy significantly reduces unsafe behavior after fine-tuning and, in some cases, almost eliminates such behavior.