The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness

Neeraj Varshney,Pavel Dolin,Agastya Seth,Chitta Baral
2023-12-31
Abstract:As Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications, their safety concerns become critical areas of NLP research. This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark: a collection of diverse safe and unsafe prompts with carefully designed evaluation methods that facilitate systematic evaluation, comparison, and analysis over 'safety' and 'over-defensiveness.' With SODE, we study a variety of LLM defense strategies over multiple state-of-the-art LLMs, which reveals several interesting and important findings, such as (a) the widely popular 'self-checking' techniques indeed improve the safety against unsafe inputs, but this comes at the cost of extreme over-defensiveness on the safe inputs, (b) providing a safety instruction along with in-context exemplars (of both safe and unsafe inputs) consistently improves safety and also mitigates undue over-defensiveness of the models, (c) providing contextual knowledge easily breaks the safety guardrails and makes the models more vulnerable to generating unsafe responses. Overall, our work reveals numerous such critical findings that we believe will pave the way and facilitate further research in improving the safety of LLMs.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the safety and over - defensiveness issues of large - language models (LLMs) in handling natural - language - processing tasks. As LLMs play an increasingly important role in natural - language - processing applications, their safety issues have become a key area of NLP research. Specifically, the paper focuses on how to evaluate and improve the performance of LLMs when facing unsafe inputs while avoiding unnecessary over - defensiveness when handling safe inputs. To systematically evaluate and compare different LLM defense strategies, the authors propose a benchmark named Safety and Over - Defensiveness Evaluation (SODE). SODE contains a series of diverse safe and unsafe prompts and has designed detailed evaluation methods to ensure that the evaluation of "safety" and "over - defensiveness" is both systematic and comprehensive. The main contributions of the paper include: 1. **Proposing the SODE benchmark**: It is used to evaluate and analyze the performance of LLMs in terms of safety and over - defensiveness. 2. **Systematically studying multiple defense strategies**: For several state - of - the - art LLMs, the effects of multiple defense strategies have been studied, revealing several important findings, such as: - Although the self - checking technique improves the protection ability against unsafe inputs, it also leads to extreme over - defensiveness against safe inputs. - Providing safety instructions and context examples (including examples of safe and unsafe inputs) can significantly improve safety and reduce unnecessary over - defensiveness. - Providing context knowledge easily undermines the safety protection mechanism, making the model more likely to generate unsafe responses. 3. **Providing practical suggestions**: Based on the research results, the paper proposes practical suggestions for improving the safety of LLMs, providing directions for future research. Overall, by introducing the SODE benchmark and systematically studying multiple defense strategies, this paper aims to promote the reliability and safety of LLMs in practical applications.