Abstract:LLM-as-a-Judge uses a large language model (LLM) to select the best response from a set of candidates for a given question. LLM-as-a-Judge has many applications such as LLM-powered search, reinforcement learning with AI feedback (RLAIF), and tool selection. In this work, we propose JudgeDeceiver, an optimization-based prompt injection attack to LLM-as-a-Judge. JudgeDeceiver injects a carefully crafted sequence into an attacker-controlled candidate response such that LLM-as-a-Judge selects the candidate response for an attacker-chosen question no matter what other candidate responses are. Specifically, we formulate finding such sequence as an optimization problem and propose a gradient based method to approximately solve it. Our extensive evaluation shows that JudgeDeceive is highly effective, and is much more effective than existing prompt injection attacks that manually craft the injected sequences and jailbreak attacks when extended to our problem. We also show the effectiveness of JudgeDeceiver in three case studies, i.e., LLM-powered search, RLAIF, and tool selection. Moreover, we consider defenses including known-answer detection, perplexity detection, and perplexity windowed detection. Our results show these defenses are insufficient, highlighting the urgent need for developing new defense strategies.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the security vulnerabilities of LLM - as - a - Judge (that is, the application scenario of using large - language models as evaluators to select the best response), especially how to carry out prompt injection attacks on LLM - as - a - Judge through optimization methods. Specifically, the author proposes an attack framework named JudgeDeceiver, which aims to inject carefully designed sequences into the candidate responses controlled by the attacker, so that the LLM - as - a - Judge will select the candidate response specified by the attacker for a given question, regardless of other candidate responses. ### Main contributions of the paper 1. **Proposing JudgeDeceiver**: This is the first optimization - based prompt injection attack method specifically designed for LLM - as - a - Judge. Different from the method of manually constructing injection sequences, JudgeDeceiver provides an automated framework to generate injection sequences. 2. **Modeling of optimization problems**: The author models the prompt injection attack as an optimization problem and generates injection sequences by minimizing the weighted sum of three loss functions. These three loss functions are: - **Target - aligned Generation Loss**: Increase the probability of the LLM generating the target output. - **Target Enhancement Loss**: Increase the probability of the target response position index in the output to enhance the robustness of the attack to position changes. - **Adversarial Perplexity Loss**: Reduce the impact of the injection sequence on the overall text perplexity, so that it can be more naturally integrated into the target text, thereby evading perplexity - based defense mechanisms. 3. **Systematic evaluation**: The author has carried out extensive experiments on multiple LLMs and benchmark datasets to verify the effectiveness of JudgeDeceiver. The experimental results show that JudgeDeceiver is significantly superior to existing manual prompt injection attack methods and jailbreak attack methods in terms of attack success rate and position attack consistency. 4. **Practical application cases**: The author has evaluated the effect of JudgeDeceiver in three practical application scenarios, including LLM - driven search, Reinforcement Learning with AI Feedback (RLAIF), and tool selection. The results show that JudgeDeceiver also has a high attack success rate in these three scenarios, revealing the potential risks of deploying LLM - as - a - Judge in these scenarios. 5. **Exploration of defense strategies**: The author also explored the defense effects of three detection and defense methods (known - answer detection, perplexity detection, window perplexity detection) on JudgeDeceiver. The experimental results show that these defense methods are insufficient in detecting injection sequences, emphasizing the urgency of developing new defense strategies. ### Conclusion Through the above contributions, this paper not only reveals the potential vulnerabilities of LLM - as - a - Judge in terms of security, but also provides an efficient attack method and points out the deficiencies of current defense mechanisms, providing an important reference for future research and practice.

Optimization-based Prompt Injection Attack to LLM-as-a-Judge

Defense Against Prompt Injection Attack by Leveraging Attack Techniques

Automatic and Universal Prompt Injection Attacks against Large Language Models

Formalizing and Benchmarking Prompt Injection Attacks and Defenses

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

Making LLMs Vulnerable to Prompt Injection via Poisoning Alignment

Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks

DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions

Systematically Analyzing Prompt Injection Vulnerabilities in Diverse LLM Architectures

DROJ: A Prompt-Driven Attack against Large Language Models

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models

Prompt Injection attack against LLM-integrated Applications

Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Applications

PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs

Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Soft Begging: Modular and Efficient Shielding of LLMs against Prompt Injection and Jailbreaking based on Prompt Tuning

Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection