Abstract:Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of jailbreak-defense when we only want to forbid a narrowly-defined set of behaviors. As a case study, we focus on preventing an LLM from helping a user make a bomb. We find that popular defenses such as safety training, adversarial training, and input/output classifiers are unable to fully solve this problem. In pursuit of a better solution, we develop a transcript-classifier defense which outperforms the baseline defenses we test. However, our classifier defense still fails in some circumstances, which highlights the difficulty of jailbreak-defense even in a narrow domain.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to explore and solve the defense challenges of large - language models (LLMs) when facing so - called "jailbreak" attacks, especially the problem of preventing specific harmful behaviors within a limited scope. Specifically, the authors focus on a narrow and specific case: **preventing large - language models from helping users make bombs**. #### Research Background and Motivation As the capabilities of large - language models continue to increase, it is crucial to ensure that these models are not misused. In particular, it is necessary to prevent obtaining harmful information from the models through "jailbreak" attacks. However, existing defense methods (such as safety training, adversarial training, and input/output classifiers) cannot fully solve this problem, especially for broadly defined harmful behaviors. Therefore, the authors chose a narrower area - preventing the model from helping users make bombs - in the hope of finding a more effective solution. #### Specific Problem Description 1. **Research Objectives**: - Explore the difficulty of jailbreak defense within a limited scope (i.e., only prohibiting specific harmful behaviors). - Focus on how to prevent the model from providing sufficient bomb - making details that exceed the information in public resources such as Wikipedia and are sufficient to guide practical operations. 2. **Challenges**: - Even in such a narrow area, existing defense mechanisms still have loopholes. - It is necessary to develop a new defense method that can more effectively identify and prevent harmful behaviors. 3. **Solutions**: - Developed a classifier defense method based on transcripts, called "CoT - 4o", which combines chain - of - thought and strict parsing strategies. - Evaluate the effectiveness of existing defense methods through various attack tests and propose improvement measures. #### Main Findings Although the newly developed classifier defense method is superior to existing methods in some aspects, it still has limitations and cannot fully solve the jailbreak defense problem. This indicates that even in a narrow area, achieving completely reliable jailbreak defense remains a highly challenging problem. ### Summary The core problem of this paper is to explore effective methods to prevent large - language models from helping users make bombs within a limited scope. Through experiments and analysis, the authors revealed the deficiencies of existing defense mechanisms and proposed a new classifier defense method based on transcripts. However, their research also shows that even defense in a narrow area faces many challenges.

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Defending Jailbreak Prompts via In-Context Adversarial Game

Distract Large Language Models for Automatic Jailbreak Attack

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks

MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks

Improved Large Language Model Jailbreak Detection via Pretrained Embeddings

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

Comprehensive Assessment of Jailbreak Attacks Against LLMs

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Jailbreaker in Jail: Moving Target Defense for Large Language Models

Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization

Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks