DeepInception: Hypnotize Large Language Model to Be Jailbreaker

Xuan Li,Zhanke Zhou,Jianing Zhu,Jiangchao Yao,Tongliang Liu,Bo Han

2024-05-23

Abstract:Despite remarkable success in various applications, large language models (LLMs) are vulnerable to adversarial jailbreaks that make the safety guardrails void. However, previous studies for jailbreaks usually resort to brute-force optimization or extrapolations of a high computation cost, which might not be practical or effective. In this paper, inspired by the Milgram experiment w.r.t. the authority power for inciting harmfulness, we disclose a lightweight method, termed as DeepInception, which can hypnotize an LLM to be a jailbreaker. Specifically, DeepInception leverages the personification ability of LLM to construct a virtual, nested scene to jailbreak, which realizes an adaptive way to escape the usage control in a normal scenario. Empirically, DeepInception can achieve competitive jailbreak success rates with previous counterparts and realize a continuous jailbreak in subsequent interactions, which reveals the critical weakness of self-losing on both open-source and closed-source LLMs like Falcon, Vicuna-v1.5, Llama-2, GPT-3.5, and GPT-4. The code is publicly available at:

Machine Learning,Cryptography and Security

What problem does this paper attempt to address?

The paper attempts to address the security vulnerabilities of large language models (LLMs), particularly their susceptibility to adversarial jailbreak attacks, which can bypass safety measures to generate harmful content. Although existing research has proposed some jailbreak methods, these methods often require high computational costs or impractical manual optimization. Therefore, the paper aims to develop a lightweight method that can effectively enable LLMs to escape usage control under normal conditions while revealing key weaknesses in both open-source and closed-source LLMs during continuous interactions. Specifically, the paper is inspired by the influence of authority on inducing harmful behavior in the Milgram experiment and proposes a method called DeepInception. DeepInception leverages the role-playing capabilities of LLMs to construct a virtual, nested scenario to achieve adaptive escape. In this way, DeepInception can achieve continuous jailbreak in subsequent interactions and demonstrates a high jailbreak success rate across various LLMs, including Falcon, Vicuna-v1.5, Llama-2, GPT-3.5, and GPT-4. The main contributions of the paper include: 1. Discovering the mechanism of implementing jailbreak attacks based on LLMs' role-playing capabilities and psychological self-loss mechanisms. 2. Proposing a general prompt-based DeepInception method that can achieve jailbreak in different scenarios without further adjustments. 3. Experimentally demonstrating that DeepInception outperforms existing methods in jailbreak success rate and can achieve continuous jailbreak in subsequent interactions. Through these contributions, the paper not only reveals key weaknesses in the security of LLMs but also provides important references for designing corresponding defense mechanisms.

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

Distract Large Language Models for Automatic Jailbreak Attack

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Jailbreaking Black Box Large Language Models in Twenty Queries

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

Diversity Helps Jailbreak Large Language Models

Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models

Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks

Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

Playing Language Game with LLMs Leads to Jailbreaking

Jailbreaking Proprietary Large Language Models using Word Substitution Cipher