Abstract:As Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications, their safety concerns become critical areas of NLP research. This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark: a collection of diverse safe and unsafe prompts with carefully designed evaluation methods that facilitate systematic evaluation, comparison, and analysis over 'safety' and 'over-defensiveness.' With SODE, we study a variety of LLM defense strategies over multiple state-of-the-art LLMs, which reveals several interesting and important findings, such as (a) the widely popular 'self-checking' techniques indeed improve the safety against unsafe inputs, but this comes at the cost of extreme over-defensiveness on the safe inputs, (b) providing a safety instruction along with in-context exemplars (of both safe and unsafe inputs) consistently improves safety and also mitigates undue over-defensiveness of the models, (c) providing contextual knowledge easily breaks the safety guardrails and makes the models more vulnerable to generating unsafe responses. Overall, our work reveals numerous such critical findings that we believe will pave the way and facilitate further research in improving the safety of LLMs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the safety and over - defensiveness issues of large - language models (LLMs) in handling natural - language - processing tasks. As LLMs play an increasingly important role in natural - language - processing applications, their safety issues have become a key area of NLP research. Specifically, the paper focuses on how to evaluate and improve the performance of LLMs when facing unsafe inputs while avoiding unnecessary over - defensiveness when handling safe inputs. To systematically evaluate and compare different LLM defense strategies, the authors propose a benchmark named Safety and Over - Defensiveness Evaluation (SODE). SODE contains a series of diverse safe and unsafe prompts and has designed detailed evaluation methods to ensure that the evaluation of "safety" and "over - defensiveness" is both systematic and comprehensive. The main contributions of the paper include: 1. **Proposing the SODE benchmark**: It is used to evaluate and analyze the performance of LLMs in terms of safety and over - defensiveness. 2. **Systematically studying multiple defense strategies**: For several state - of - the - art LLMs, the effects of multiple defense strategies have been studied, revealing several important findings, such as: - Although the self - checking technique improves the protection ability against unsafe inputs, it also leads to extreme over - defensiveness against safe inputs. - Providing safety instructions and context examples (including examples of safe and unsafe inputs) can significantly improve safety and reduce unnecessary over - defensiveness. - Providing context knowledge easily undermines the safety protection mechanism, making the model more likely to generate unsafe responses. 3. **Providing practical suggestions**: Based on the research results, the paper proposes practical suggestions for improving the safety of LLMs, providing directions for future research. Overall, by introducing the SODE benchmark and systematically studying multiple defense strategies, this paper aims to promote the reliability and safety of LLMs in practical applications.

The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness

The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

Certifying LLM Safety against Adversarial Prompting

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

Mitigating Unsafe Feedback with Learning Constraints

S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward

LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types

SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models

Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level

Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

Ollabench: Evaluating LLMs' Reasoning for Human-centric Interdependent Cybersecurity

Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey

SAFETY-J: Evaluating Safety with Critique

Semantic loss guided data efficient supervised fine tuning for Safe Responses in LLMs

CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs

Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness