Abstract:Large language models (LLMs), such as ChatGPT, have emerged with astonishing capabilities approaching artificial general intelligence. While providing convenience for various societal needs, LLMs have also lowered the cost of generating harmful content. Consequently, LLM developers have deployed semantic-level defenses to recognize and reject prompts that may lead to inappropriate content. Unfortunately, these defenses are not foolproof, and some attackers have crafted "jailbreak" prompts that temporarily hypnotize the LLM into forgetting content defense rules and answering any improper questions. To date, there is no clear explanation of the principles behind these semantic-level attacks and defenses in both industry and academia. This paper investigates the LLM jailbreak problem and proposes an automatic jailbreak method for the first time. We propose the concept of a semantic firewall and provide three technical implementation approaches. Inspired by the attack that penetrates traditional firewalls through reverse tunnels, we introduce a "self-deception" attack that can bypass the semantic firewall by inducing LLM to generate prompts that facilitate jailbreak. We generated a total of 2,520 attack payloads in six languages (English, Russian, French, Spanish, Chinese, and Arabic) across seven virtual scenarios, targeting the three most common types of violations: violence, hate, and pornography. The experiment was conducted on two models, namely the GPT-3.5-Turbo and GPT-4. The success rates on the two models were 86.2% and 67%, while the failure rates were 4.7% and 2.2%, respectively. This highlighted the effectiveness of the proposed attack method. All experimental code and raw data will be released as open-source to inspire future research. We believe that manipulating AI behavior through carefully crafted prompts will become an important research direction in the future.

Using Hallucinations to Bypass GPT4's Filter

LLM Lies: Hallucinations are not Bugs, but Features as Adversarial Examples

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

A Debate-Driven Experiment on LLM Hallucinations and Accuracy

Redefining "Hallucination" in LLMs: Towards a psychology-informed framework for mitigating misinformation

FlipAttack: Jailbreak LLMs via Flipping

Can Large Language Models Automatically Jailbreak GPT-4V?

Banishing LLM Hallucinations Requires Rethinking Generalization

Removing RLHF Protections in GPT-4 via Fine-Tuning

Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models

Look Within, Why LLMs Hallucinate: A Causal Perspective

On-Policy Fine-grained Knowledge Feedback for Hallucination Mitigation

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

Automatic Hallucination Assessment for Aligned Large Language Models via Transferable Adversarial Attacks

A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation

Mitigating Large Language Model Hallucination with Faithful Finetuning

Alleviating Hallucinations of Large Language Models through Induced Hallucinations

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

Who's Harry Potter? Approximate Unlearning in LLMs

Trapping LLM Hallucinations Using Tagged Context Prompts

Hallucination Detection and Hallucination Mitigation: An Investigation