Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Tony T. Wang,John Hughes,Henry Sleight,Rylan Schaeffer,Rajashree Agrawal,Fazl Barez,Mrinank Sharma,Jesse Mu,Nir Shavit,Ethan Perez
2024-12-03
Abstract:Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of jailbreak-defense when we only want to forbid a narrowly-defined set of behaviors. As a case study, we focus on preventing an LLM from helping a user make a bomb. We find that popular defenses such as safety training, adversarial training, and input/output classifiers are unable to fully solve this problem. In pursuit of a better solution, we develop a transcript-classifier defense which outperforms the baseline defenses we test. However, our classifier defense still fails in some circumstances, which highlights the difficulty of jailbreak-defense even in a narrow domain.
Machine Learning,Artificial Intelligence,Computation and Language,Cryptography and Security
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to explore and solve the defense challenges of large - language models (LLMs) when facing so - called "jailbreak" attacks, especially the problem of preventing specific harmful behaviors within a limited scope. Specifically, the authors focus on a narrow and specific case: **preventing large - language models from helping users make bombs**. #### Research Background and Motivation As the capabilities of large - language models continue to increase, it is crucial to ensure that these models are not misused. In particular, it is necessary to prevent obtaining harmful information from the models through "jailbreak" attacks. However, existing defense methods (such as safety training, adversarial training, and input/output classifiers) cannot fully solve this problem, especially for broadly defined harmful behaviors. Therefore, the authors chose a narrower area - preventing the model from helping users make bombs - in the hope of finding a more effective solution. #### Specific Problem Description 1. **Research Objectives**: - Explore the difficulty of jailbreak defense within a limited scope (i.e., only prohibiting specific harmful behaviors). - Focus on how to prevent the model from providing sufficient bomb - making details that exceed the information in public resources such as Wikipedia and are sufficient to guide practical operations. 2. **Challenges**: - Even in such a narrow area, existing defense mechanisms still have loopholes. - It is necessary to develop a new defense method that can more effectively identify and prevent harmful behaviors. 3. **Solutions**: - Developed a classifier defense method based on transcripts, called "CoT - 4o", which combines chain - of - thought and strict parsing strategies. - Evaluate the effectiveness of existing defense methods through various attack tests and propose improvement measures. #### Main Findings Although the newly developed classifier defense method is superior to existing methods in some aspects, it still has limitations and cannot fully solve the jailbreak defense problem. This indicates that even in a narrow area, achieving completely reliable jailbreak defense remains a highly challenging problem. ### Summary The core problem of this paper is to explore effective methods to prevent large - language models from helping users make bombs within a limited scope. Through experiments and analysis, the authors revealed the deficiencies of existing defense mechanisms and proposed a new classifier defense method based on transcripts. However, their research also shows that even defense in a narrow area faces many challenges.