Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles

Zhilong Wang,Haizhou Wang,Nanqing Luo,Lan Zhang,Xiaoyan Sun,Yebo Cao,Peng Liu
2024-08-21
Abstract:Jailbreak attacks on Language Model Models (LLMs) entail crafting prompts aimed at exploiting the models to generate malicious content. This paper proposes a new type of jailbreak attacks which shift the attention of the LLM by inserting a prohibited query into a carrier article. The proposed attack leverage the knowledge graph and a composer LLM to automatically generating a carrier article that is similar to the topic of the prohibited query but does not violate LLM's safeguards. By inserting the malicious query to the carrier article, the assembled attack payload can successfully jailbreak LLM. To evaluate the effectiveness of our method, we leverage 4 popular categories of ``harmful behaviors'' adopted by related researches to attack 6 popular LLMs. Our experiment results show that the proposed attacking method can successfully jailbreak all the target LLMs which high success rate, except for Claude-3.
Cryptography and Security,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the problem of how to break the security mechanisms of large language models (LLMs) through a new attack method. Specifically, the paper proposes a new method called "neural carrier articles," which embeds malicious queries into seemingly harmless articles to bypass the security protections of LLMs. This method leverages knowledge graphs and another composer LLM to automatically generate articles that are similar in topic to the malicious queries but do not violate the LLM's security protections. By inserting malicious queries into these articles, the security defenses of the target LLM can be successfully breached. The paper evaluates the effectiveness of this method through experiments, testing six popular LLM models and finding that all models except Claude-3 can be successfully breached, with success rates ranging from 21.28% to 92.55%. Additionally, the paper explores the impact of insertion position, carrier article topic, and article length on the attack success rate.