Abstract:Multimodal Large Language Models (MLLMs) are showing strong safety concerns (e.g., generating harmful outputs for users), which motivates the development of safety evaluation benchmarks. However, we observe that existing safety benchmarks for MLLMs show limitations in query quality and evaluation reliability limiting the detection of model safety implications as MLLMs continue to evolve. In this paper, we propose \toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs. Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol that aims to address the above limitations, respectively. We first design an automatic safety dataset generation pipeline, where we employ a set of LLM judges to recognize and categorize the risk scenarios that are most harmful and diverse for MLLMs; based on the taxonomy, we further ask these judges to generate high-quality harmful queries accordingly resulting in 23 risk scenarios with 2,300 multi-modal harmful query pairs. During safety evaluation, we draw inspiration from the jury system in judicial proceedings and pioneer the jury deliberation evaluation protocol that adopts collaborative LLMs to evaluate whether target models exhibit specific harmful behaviors, providing a reliable and unbiased assessment of content security risks. In addition, our benchmark can also be extended to the audio modality showing high scalability and potential. Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs (e.g., GPT-4o, Gemini), where we revealed widespread safety issues in existing MLLMs and instantiated several insights on MLLM safety performance such as image quality and parameter size.

On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark

A Benchmark for Understanding Dialogue Safety in Mental Health Support

Recent Advances towards Safe, Responsible, and Moral Dialogue Systems: A Survey

Using In-Context Learning to Improve Dialogue Safety

Improving Dialog Safety using Socially Aware Contrastive Learning

ProsocialDialog: A Prosocial Backbone for Conversational Agents

Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark

Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling

CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference

CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models

Towards Identifying Social Bias in Dialog Systems: Frame, Datasets, and Benchmarks

Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue

SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese

Challenges in Building Intelligent Open-domain Dialog Systems

Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation

Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey

Improving Dialogue Management: Quality Datasets vs Models

A Chinese Dataset for Evaluating the Safeguards in Large Language Models

SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models

Safer Conversational AI as a Source of User Delight

Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems