Abstract:Previous research on testing the vulnerabilities in Large Language Models (LLMs) using adversarial attacks has primarily focused on nonsensical prompt injections, which are easily detected upon manual or automated review (e.g., via byte entropy). However, the exploration of innocuous human-understandable malicious prompts augmented with adversarial injections remains limited. In this research, we explore converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing. This allows us to show suffix conversion without any gradients, using only LLMs to perform the attacks, and thus better understand the scope of possible risks. We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM. The situations are extracted from the IMDB dataset, and prompts are defined following a few-shot chain-of-thought prompting. Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs. We find that across many LLMs, as few as 1 attempt produces an attack and that these attacks transfer between LLMs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to conduct human - interpretable adversarial prompt attacks in large - language models (LLMs), especially by context - driven context rewriting to transform meaningless suffix attacks into meaningful prompts. Specifically, researchers explored how to convert originally meaningless adversarial suffixes into malicious prompts understandable to humans without using gradient information, and combined with situational information such as movie plots to make these prompts more difficult to be recognized by automatic detection systems, thereby testing and demonstrating the security vulnerabilities of LLMs. ### Main research questions: 1. **How to convert meaningless adversarial suffixes into malicious prompts understandable to humans**: Researchers converted meaningless adversarial suffixes into malicious prompts understandable to humans by using situational information such as movie plots to increase the stealth and success rate of attacks. 2. **Evaluating the vulnerability of different LLMs to such attacks**: Researchers tested a variety of open - source and proprietary LLMs to evaluate their resistance to this new type of adversarial attack. 3. **Exploring the transferability of adversarial attacks**: Researchers also explored whether these attacks can be transferred between different LLMs, that is, whether an attack that is successful on one model can also be successful on other models. ### Research methods: - **Generating adversarial suffixes**: Use the method proposed by Andriushchenko et al. to generate model - specific optimized adversarial suffixes. - **Converting to human - interpretable adversarial suffixes**: Use GPT - 3.5 to convert the generated adversarial suffixes into human - understandable phrases. - **Constructing malicious prompts and situational prompts**: Combine malicious prompts, adversarial insertions and movie plots to construct a complete prompt structure. - **Experimental verification**: Use GPT - 4 Judge to evaluate the responses of different LLMs to these prompts and measure their harmfulness scores. ### Experimental results: - **Vulnerability of different LLMs**: The study found that different LLMs have different vulnerabilities to this attack, but most models can be successfully attacked in one attempt. - **Transferability of attacks**: The transferability of attacks between different LLMs is relatively strong, indicating that this attack method has a certain degree of universality. ### Conclusion: This study shows how to convert meaningless adversarial suffixes into malicious prompts understandable to humans without using gradient information, and successfully attacked a variety of LLMs through context - driven context rewriting. This indicates that current LLMs still have a large room for improvement in terms of security and need to further enhance their ability to resist such attacks.

Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

Automatic and Universal Prompt Injection Attacks against Large Language Models

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Universal and Transferable Adversarial Attacks on Aligned Language Models

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

SoK: Prompt Hacking of Large Language Models

Hijacking Large Language Models via Adversarial In-Context Learning

Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information

Cognitive Overload Attack:Prompt Injection for Long Context

More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models

Attack Prompt Generation for Red Teaming and Defending Large Language Models

DROJ: A Prompt-Driven Attack against Large Language Models

Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

Enhancing Adversarial Resistance in LLMs with Recursion

Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models

Exploring the Adversarial Capabilities of Large Language Models

MaPPing Your Model: Assessing the Impact of Adversarial Attacks on LLM-based Programming Assistants

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings