DeepInception: Hypnotize Large Language Model to Be Jailbreaker

Xuan Li,Zhanke Zhou,Jianing Zhu,Jiangchao Yao,Tongliang Liu,Bo Han
2024-05-23
Abstract:Despite remarkable success in various applications, large language models (LLMs) are vulnerable to adversarial jailbreaks that make the safety guardrails void. However, previous studies for jailbreaks usually resort to brute-force optimization or extrapolations of a high computation cost, which might not be practical or effective. In this paper, inspired by the Milgram experiment w.r.t. the authority power for inciting harmfulness, we disclose a lightweight method, termed as DeepInception, which can hypnotize an LLM to be a jailbreaker. Specifically, DeepInception leverages the personification ability of LLM to construct a virtual, nested scene to jailbreak, which realizes an adaptive way to escape the usage control in a normal scenario. Empirically, DeepInception can achieve competitive jailbreak success rates with previous counterparts and realize a continuous jailbreak in subsequent interactions, which reveals the critical weakness of self-losing on both open-source and closed-source LLMs like Falcon, Vicuna-v1.5, Llama-2, GPT-3.5, and GPT-4. The code is publicly available at:
Machine Learning,Cryptography and Security
What problem does this paper attempt to address?
The paper attempts to address the security vulnerabilities of large language models (LLMs), particularly their susceptibility to adversarial jailbreak attacks, which can bypass safety measures to generate harmful content. Although existing research has proposed some jailbreak methods, these methods often require high computational costs or impractical manual optimization. Therefore, the paper aims to develop a lightweight method that can effectively enable LLMs to escape usage control under normal conditions while revealing key weaknesses in both open-source and closed-source LLMs during continuous interactions. Specifically, the paper is inspired by the influence of authority on inducing harmful behavior in the Milgram experiment and proposes a method called DeepInception. DeepInception leverages the role-playing capabilities of LLMs to construct a virtual, nested scenario to achieve adaptive escape. In this way, DeepInception can achieve continuous jailbreak in subsequent interactions and demonstrates a high jailbreak success rate across various LLMs, including Falcon, Vicuna-v1.5, Llama-2, GPT-3.5, and GPT-4. The main contributions of the paper include: 1. Discovering the mechanism of implementing jailbreak attacks based on LLMs' role-playing capabilities and psychological self-loss mechanisms. 2. Proposing a general prompt-based DeepInception method that can achieve jailbreak in different scenarios without further adjustments. 3. Experimentally demonstrating that DeepInception outperforms existing methods in jailbreak success rate and can achieve continuous jailbreak in subsequent interactions. Through these contributions, the paper not only reveals key weaknesses in the security of LLMs but also provides important references for designing corresponding defense mechanisms.