ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings

Hao Wang,Hao Li,Minlie Huang,Lei Sha

2024-06-04

Abstract:The safety defense methods of Large language models(LLMs) stays limited because the dangerous prompts are manually curated to just few known attack types, which fails to keep pace with emerging varieties. Recent studies found that attaching suffixes to harmful instructions can hack the defense of LLMs and lead to dangerous outputs. However, similar to traditional text adversarial attacks, this approach, while effective, is limited by the challenge of the discrete tokens. This gradient based discrete optimization attack requires over 100,000 LLM calls, and due to the unreadable of adversarial suffixes, it can be relatively easily penetrated by common defense methods such as perplexity filters. To cope with this challenge, in this paper, we proposes an Adversarial Suffix Embedding Translation Framework (ASETF), aimed at transforming continuous adversarial suffix embeddings into coherent and understandable text. This method greatly reduces the computational overhead during the attack process and helps to automatically generate multiple adversarial samples, which can be used as data to strengthen LLMs security defense. Experimental evaluations were conducted on Llama2, Vicuna, and other prominent LLMs, employing harmful directives sourced from the Advbench dataset. The results indicate that our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques, while significantly enhancing the textual fluency of the prompts. In addition, our approach can be generalized into a broader method for generating transferable adversarial suffixes that can successfully attack multiple LLMs, even black-box LLMs, such as ChatGPT and Gemini.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the issue of security defense for large language models (LLMs) when faced with adversarial attacks. Specifically, existing security defense methods are limited to manually curated small sets of known attack types, making them unable to cope with the continuously emerging new attack techniques. The study found that by appending suffixes, it is possible to bypass the security defense mechanisms of LLMs, leading to the generation of harmful content. However, this method is constrained by the challenges of discrete tokens, requires a large number of LLM calls, and the generated adversarial suffixes are difficult to read and easily identified by common defense methods such as perplexity filters. To address the above issues, the paper proposes a method called Adversarial Suffix Embedding Translation Framework (ASETF), which aims to convert continuous adversarial suffix embeddings into coherent and understandable text. This method significantly reduces the computational overhead during the attack process and can automatically generate multiple adversarial samples, thereby enhancing the security defense capabilities of LLMs. Experimental results show that this method not only improves the attack success rate but also significantly enhances the fluency and robustness of input prompts, and is capable of generating universal adversarial suffixes against various LLMs, including black-box models such as ChatGPT and Gemini.

ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings

Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities

Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models

AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs

Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer

GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs

LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs

Mitigating Adversarial Attacks in LLMs through Defensive Suffix Generation

AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs

AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts

Universal and Transferable Adversarial Attacks on Aligned Language Models

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation

Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs

Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models