Learning diverse attacks on large language models for robust red-teaming and safety tuning

Seanie Lee,Minsu Kim,Lynn Cherif,David Dobre,Juho Lee,Sung Ju Hwang,Kenji Kawaguchi,Gauthier Gidel,Yoshua Bengio,Nikolay Malkin,Moksh Jain
2024-05-29
Abstract:Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.
Computation and Language,Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
This paper aims to solve the problem of how to automatically discover attack prompts (i.e., "attack" prompts) that can trigger harmful responses in large - language models (LLMs) before deployment. Specifically, the paper focuses on how to generate diverse attack prompts that are effective on different target LLMs and can also transfer well between security - fine - tuned models. This step is crucial for ensuring the safe and responsible deployment of LLMs. ### Main contributions of the paper: 1. **Generate diverse and effective attack prompts**: From a probabilistic perspective, the paper proposes using the GFlowNet fine - tuning method to generate diverse and effective attack prompts. This method explores diverse attack prompts by sampling the posterior distribution while maintaining the effectiveness of the attack. 2. **Smoothing and re - ranking steps**: In order to generalize from the high - reward samples discovered during the GFlowNet fine - tuning process, the paper proposes a two - stage GFlowNet fine - tuning process, which includes a maximum - likelihood - estimation (MLE) smoothing step. This method not only improves the performance of the attack model but also enables the model to efficiently adapt to new target LLMs. 3. **Cross - model attack transfer ability**: Experimental results show that the attack prompts generated by the attack model trained with GFlowNet fine - tuning and MLE smoothing perform well not only on multiple target LLMs but also have good transfer ability on other target LLMs that were not used in training. 4. **Effect of security fine - tuning**: When using the red - team attack prompts proposed in the paper to perform security fine - tuning on target LLMs, the model can more effectively resist attacks generated by other reinforcement - learning - based red - team methods without a performance degradation in other tasks. ### Experimental results: - **Trade - off between diversity and toxicity rate**: The paper experimentally verifies the trade - off between diversity and toxicity rate in different methods for generating attack prompts. The results show that the GFlowNet + MLE method can generate diverse attack prompts while maintaining a high toxicity rate, outperforming other baseline methods. - **Cross - model attack transfer**: The experiment also verifies the effectiveness of the attack prompts generated by GFlowNet + MLE on multiple unseen target LLMs, showing good cross - model transfer ability. - **Effect of security fine - tuning**: After using the attack prompts generated by GFlowNet + MLE to perform security fine - tuning on target LLMs, the model can more effectively resist attacks generated by other methods, and there is no performance degradation in other tasks. In conclusion, the paper proposes a novel two - stage GFlowNet fine - tuning method that can generate diverse and effective attack prompts, which are not only effective on the current target LLMs but also perform well on other unseen LLMs, thus providing strong support for improving the security of LLMs.