Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context

Nilanjana Das,Edward Raff,Manas Gaur
2024-07-26
Abstract:Previous research on testing the vulnerabilities in Large Language Models (LLMs) using adversarial attacks has primarily focused on nonsensical prompt injections, which are easily detected upon manual or automated review (e.g., via byte entropy). However, the exploration of innocuous human-understandable malicious prompts augmented with adversarial injections remains limited. In this research, we explore converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing. This allows us to show suffix conversion without any gradients, using only LLMs to perform the attacks, and thus better understand the scope of possible risks. We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM. The situations are extracted from the IMDB dataset, and prompts are defined following a few-shot chain-of-thought prompting. Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs. We find that across many LLMs, as few as 1 attempt produces an attack and that these attacks transfer between LLMs.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to conduct human - interpretable adversarial prompt attacks in large - language models (LLMs), especially by context - driven context rewriting to transform meaningless suffix attacks into meaningful prompts. Specifically, researchers explored how to convert originally meaningless adversarial suffixes into malicious prompts understandable to humans without using gradient information, and combined with situational information such as movie plots to make these prompts more difficult to be recognized by automatic detection systems, thereby testing and demonstrating the security vulnerabilities of LLMs. ### Main research questions: 1. **How to convert meaningless adversarial suffixes into malicious prompts understandable to humans**: Researchers converted meaningless adversarial suffixes into malicious prompts understandable to humans by using situational information such as movie plots to increase the stealth and success rate of attacks. 2. **Evaluating the vulnerability of different LLMs to such attacks**: Researchers tested a variety of open - source and proprietary LLMs to evaluate their resistance to this new type of adversarial attack. 3. **Exploring the transferability of adversarial attacks**: Researchers also explored whether these attacks can be transferred between different LLMs, that is, whether an attack that is successful on one model can also be successful on other models. ### Research methods: - **Generating adversarial suffixes**: Use the method proposed by Andriushchenko et al. to generate model - specific optimized adversarial suffixes. - **Converting to human - interpretable adversarial suffixes**: Use GPT - 3.5 to convert the generated adversarial suffixes into human - understandable phrases. - **Constructing malicious prompts and situational prompts**: Combine malicious prompts, adversarial insertions and movie plots to construct a complete prompt structure. - **Experimental verification**: Use GPT - 4 Judge to evaluate the responses of different LLMs to these prompts and measure their harmfulness scores. ### Experimental results: - **Vulnerability of different LLMs**: The study found that different LLMs have different vulnerabilities to this attack, but most models can be successfully attacked in one attempt. - **Transferability of attacks**: The transferability of attacks between different LLMs is relatively strong, indicating that this attack method has a certain degree of universality. ### Conclusion: This study shows how to convert meaningless adversarial suffixes into malicious prompts understandable to humans without using gradient information, and successfully attacked a variety of LLMs through context - driven context rewriting. This indicates that current LLMs still have a large room for improvement in terms of security and need to further enhance their ability to resist such attacks.