Abstract:We introduce a defense against adversarial attacks on LLMs utilizing self-evaluation. Our method requires no model fine-tuning, instead using pre-trained models to evaluate the inputs and outputs of a generator model, significantly reducing the cost of implementation in comparison to other, finetuning-based methods. Our method can significantly reduce the attack success rate of attacks on both open and closed-source LLMs, beyond the reductions demonstrated by Llama-Guard2 and commonly used content moderation APIs. We present an analysis of the effectiveness of our method, including attempts to attack the evaluator in various settings, demonstrating that it is also more resilient to attacks than existing methods. Code and data will be made available at <a class="link-external link-https" href="https://github.com/Linlt-leon/self-eval" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to address the security issues of large - language models (LLMs) when facing adversarial attacks. Specifically, the author proposes a self - evaluation - based defense method to prevent the impact of adversarial attacks on LLMs. ### Core Issues of the Paper 1. **Threat of Adversarial Attacks**: Although existing LLMs are trained to generate useful, harmless, and honest outputs, there are still various adversarial attack means (such as Zou et al., 2023; Zhu et al., 2023), and these attacks can bypass existing protection measures, causing the model to generate harmful content. 2. **Limitations of Existing Defense Methods**: Current defense methods include training through reinforcement learning with human feedback (RLHF), setting up guardrails during inference (Rebedea et al., 2023), and other methods for detecting harmful outputs (Team, 2024; Hu et al., 2024). However, these methods are either costly or rely on fine - tuning the model or proprietary APIs (such as OpenAI's content moderation API), and have certain limitations and vulnerabilities. ### Proposed Solution The author proposes a new defense method - **self - evaluation - based defense**, with the following main features: - **No Fine - Tuning Required**: This method does not require additional fine - tuning of the model, but uses a pre - trained model to evaluate the input and output of the generator model. - **Efficient Implementation**: Compared with other fine - tuning - based methods, this method significantly reduces the implementation cost and can effectively reduce the attack success rate on open - source and closed - source LLMs. - **Robustness**: This method can not only effectively reduce the success rate of adversarial attacks, but also shows higher robustness in different scenarios (such as attacks against the evaluator), outperforming existing defense methods (such as Llama - Guard2 and common content moderation APIs). ### Experimental Results Through experiments, the author demonstrates the effectiveness of this method in multiple settings: - **Significantly Reducing the Attack Success Rate (ASR)**: For inputs with adversarial suffixes, this method can reduce the attack success rate from 95.0% to 0.0%, significantly outperforming other defense methods. - **Strong Adaptability**: Even in the scenario of adaptive attacks, this method still shows strong resistance. Although an attacker can try to attack by training an adversarial suffix specifically for the evaluator, in the worst - case scenario, the attack success rate of this method is still lower than that of an unprotected generator. ### Conclusion The paper shows that pre - trained LLMs can accurately identify attacked inputs and outputs through self - evaluation, thus providing a powerful and easy - to - implement defense mechanism. Although there are attack means against this defense method, self - evaluation is still one of the strongest defense strategies at present, which can effectively resist adversarial attacks without degrading the model performance. --- If you need to further understand the specific experimental details or formula derivations, please feel free to let us know!

Self-Evaluation as a Defense Against Adversarial Attacks on LLMs

LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Large Language Model Sentinel: Advancing Adversarial Robustness by LLM Agent

Large Language Model Sentinel: LLM Agent for Adversarial Purification

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

LLM Evaluators Recognize and Favor Their Own Generations

SelfPrompt: Autonomously Evaluating LLM Robustness via Domain-Constrained Knowledge Guidelines and Refined Adversarial Prompts

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment

A LLM Assisted Exploitation of AI-Guardian

Self-Guard: Empower the LLM to Safeguard Itself

Universal and Transferable Adversarial Attacks on Aligned Language Models

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities

Defending Large Language Models Against Attacks With Residual Stream Activation Analysis

Can LLMs Patch Security Issues?

The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness

Can LLMs be Fooled? Investigating Vulnerabilities in LLMs