Abstract:Evaluating aligned large language models' (LLMs) ability to recognize and reject unsafe user requests is crucial for safe, policy-compliant deployments. Existing evaluation efforts, however, face three limitations that we address with SORRY-Bench, our proposed benchmark. First, existing methods often use coarse-grained taxonomies of unsafe topics, and are over-representing some fine-grained topics. For example, among the ten existing datasets that we evaluated, tests for refusals of self-harm instructions are over 3x less represented than tests for fraudulent activities. SORRY-Bench improves on this by using a fine-grained taxonomy of 45 potentially unsafe topics, and 450 class-balanced unsafe instructions, compiled through human-in-the-loop methods. Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations. We supplement SORRY-Bench with 20 diverse linguistic augmentations to systematically examine these effects. Third, existing evaluations rely on large LLMs (e.g., GPT-4) for evaluation, which can be computationally expensive. We investigate design choices for creating a fast, accurate automated safety evaluator. By collecting 7K+ human annotations and conducting a meta-evaluation of diverse LLM-as-a-judge designs, we show that fine-tuned 7B LLMs can achieve accuracy comparable to GPT-4 scale LLMs, with lower computational cost. Putting these together, we evaluate over 40 proprietary and open-source LLMs on SORRY-Bench, analyzing their distinctive refusal behaviors. We hope our effort provides a building block for systematic evaluations of LLMs' safety refusal capabilities, in a balanced, granular, and efficient manner.

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents

Agent-SafetyBench: Evaluating the Safety of LLM Agents

LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs

SafetyBench: Evaluating the Safety of Large Language Models

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

Safety Assessment of Chinese Large Language Models

SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions

AgentBench: Evaluating LLMs as Agents

CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models

SAFETY-J: Evaluating Safety with Critique

SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

Rule Based Rewards for Language Model Safety

Walking a Tightrope -- Evaluating Large Language Models in High-Risk Domains

SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

A Survey on LLM-as-a-Judge