Optimization-based Prompt Injection Attack to LLM-as-a-Judge

Jiawen Shi,Zenghui Yuan,Yinuo Liu,Yue Huang,Pan Zhou,Lichao Sun,Neil Zhenqiang Gong
2024-08-24
Abstract:LLM-as-a-Judge uses a large language model (LLM) to select the best response from a set of candidates for a given question. LLM-as-a-Judge has many applications such as LLM-powered search, reinforcement learning with AI feedback (RLAIF), and tool selection. In this work, we propose JudgeDeceiver, an optimization-based prompt injection attack to LLM-as-a-Judge. JudgeDeceiver injects a carefully crafted sequence into an attacker-controlled candidate response such that LLM-as-a-Judge selects the candidate response for an attacker-chosen question no matter what other candidate responses are. Specifically, we formulate finding such sequence as an optimization problem and propose a gradient based method to approximately solve it. Our extensive evaluation shows that JudgeDeceive is highly effective, and is much more effective than existing prompt injection attacks that manually craft the injected sequences and jailbreak attacks when extended to our problem. We also show the effectiveness of JudgeDeceiver in three case studies, i.e., LLM-powered search, RLAIF, and tool selection. Moreover, we consider defenses including known-answer detection, perplexity detection, and perplexity windowed detection. Our results show these defenses are insufficient, highlighting the urgent need for developing new defense strategies.
Cryptography and Security,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the security vulnerabilities of LLM - as - a - Judge (that is, the application scenario of using large - language models as evaluators to select the best response), especially how to carry out prompt injection attacks on LLM - as - a - Judge through optimization methods. Specifically, the author proposes an attack framework named JudgeDeceiver, which aims to inject carefully designed sequences into the candidate responses controlled by the attacker, so that the LLM - as - a - Judge will select the candidate response specified by the attacker for a given question, regardless of other candidate responses. ### Main contributions of the paper 1. **Proposing JudgeDeceiver**: This is the first optimization - based prompt injection attack method specifically designed for LLM - as - a - Judge. Different from the method of manually constructing injection sequences, JudgeDeceiver provides an automated framework to generate injection sequences. 2. **Modeling of optimization problems**: The author models the prompt injection attack as an optimization problem and generates injection sequences by minimizing the weighted sum of three loss functions. These three loss functions are: - **Target - aligned Generation Loss**: Increase the probability of the LLM generating the target output. - **Target Enhancement Loss**: Increase the probability of the target response position index in the output to enhance the robustness of the attack to position changes. - **Adversarial Perplexity Loss**: Reduce the impact of the injection sequence on the overall text perplexity, so that it can be more naturally integrated into the target text, thereby evading perplexity - based defense mechanisms. 3. **Systematic evaluation**: The author has carried out extensive experiments on multiple LLMs and benchmark datasets to verify the effectiveness of JudgeDeceiver. The experimental results show that JudgeDeceiver is significantly superior to existing manual prompt injection attack methods and jailbreak attack methods in terms of attack success rate and position attack consistency. 4. **Practical application cases**: The author has evaluated the effect of JudgeDeceiver in three practical application scenarios, including LLM - driven search, Reinforcement Learning with AI Feedback (RLAIF), and tool selection. The results show that JudgeDeceiver also has a high attack success rate in these three scenarios, revealing the potential risks of deploying LLM - as - a - Judge in these scenarios. 5. **Exploration of defense strategies**: The author also explored the defense effects of three detection and defense methods (known - answer detection, perplexity detection, window perplexity detection) on JudgeDeceiver. The experimental results show that these defense methods are insufficient in detecting injection sequences, emphasizing the urgency of developing new defense strategies. ### Conclusion Through the above contributions, this paper not only reveals the potential vulnerabilities of LLM - as - a - Judge in terms of security, but also provides an efficient attack method and points out the deficiencies of current defense mechanisms, providing an important reference for future research and practice.