Abstract:Aligned Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, LLMs remain susceptible to jailbreak adversarial attacks, where adversaries manipulate prompts to elicit malicious responses that aligned LLMs should have avoided. Identifying these vulnerabilities is crucial for understanding the inherent weaknesses of LLMs and preventing their potential misuse. One pioneering work in jailbreaking is the GCG attack, a discrete token optimization algorithm that seeks to find a suffix capable of jailbreaking aligned LLMs. Despite the success of GCG, we find it suboptimal, requiring significantly large computational costs, and the achieved jailbreaking performance is limited. In this work, we propose Faster-GCG, an efficient adversarial jailbreak method by delving deep into the design of GCG. Experiments demonstrate that Faster-GCG can surpass the original GCG with only 1/10 of the computational cost, achieving significantly higher attack success rates on various open-source aligned LLMs. In addition, We demonstrate that Faster-GCG exhibits improved attack transferability when testing on closed-sourced LLMs such as ChatGPT.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **the vulnerability of large - language models (LLMs) when facing adversarial attacks**, especially the so - called "jailbreak attacks". Such attacks, through carefully - designed prompts, can induce LLMs, which are originally trained for alignment to avoid generating harmful content, to produce malicious, violent or hateful content. Therefore, identifying these vulnerabilities is crucial for understanding the inherent weaknesses of LLMs and preventing their potential misuse. Specifically, the paper points out that although the existing Greedy Coordinate Gradient (GCG) attack method can achieve this goal to a certain extent, it has the following shortcomings: - **High computational cost**: GCG requires a large amount of computational resources to find effective jailbreak suffixes. - **Limited attack success rate**: Even with high computational cost, the attack success rate of GCG is still limited. - **Relying on unrealistic assumptions**: When using gradient information, GCG relies on an unrealistic assumption that the terms in the vocabulary are close enough to each other, which is not true in practical applications. To solve these problems, the author proposes a more efficient adversarial jailbreak method - Faster - GCG. This method significantly improves efficiency and attack success rate through the following improvements: 1. **Introducing an additional regularization term**: Consider the distance between terms when selecting candidate words, thereby improving the approximation accuracy. 2. **Using greedy sampling instead of random sampling**: Accelerate convergence through a deterministic greedy strategy. 3. **Avoiding self - looping problems**: Record historical states to prevent the algorithm from oscillating repeatedly among the same set of candidate words. Experimental results show that Faster - GCG can not only achieve a higher attack success rate with only 1/10 of the computational cost of GCG, but also shows better transferability on different types of LLMs. In addition, ablation experiments also verify the effectiveness of each technical improvement.

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts

Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization

AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs

Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities

Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer

WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs

Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

Playing Language Game with LLMs Leads to Jailbreaking

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models

Boosting Jailbreak Transferability for Large Language Models

FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks

Distract Large Language Models for Automatic Jailbreak Attack