Abstract:Multimodal Large Language Models (MLLMs) are rapidly evolving, demonstrating impressive capabilities as multimodal assistants that interact with both humans and their environments. However, this increased sophistication introduces significant safety concerns. In this paper, we present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety, which explores how safety considerations vary based on the specific situation in which the user or agent is engaged. We argue that for an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context. To evaluate this capability, we develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs. The dataset comprises 1,820 language query-image pairs, half of which the image context is safe, and the other half is unsafe. We also develop an evaluation framework that analyzes key safety aspects, including explicit safety reasoning, visual understanding, and, crucially, situational safety reasoning. Our findings reveal that current MLLMs struggle with this nuanced safety problem in the instruction-following setting and struggle to tackle these situational safety challenges all at once, highlighting a key area for future research. Furthermore, we develop multi-agent pipelines to coordinately solve safety challenges, which shows consistent improvement in safety over the original MLLM response. Code and data: <a class="link-external link-http" href="http://mssbench.github.io" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how multimodal large - language models (MLLMs) can judge the safety of a query based on the real - time visual environment when processing user queries. Specifically, the researchers proposed a new problem - **Multimodal Situational Safety**, that is, given a language query and a real - time visual environment, the model needs to evaluate the safety of this query in the current visual environment and adjust its answer accordingly. ### Problem Background With the development of multimodal large - language models (MLLMs), they can not only understand text, but also understand and process other forms of input such as images. This enables MLLMs to serve as multimodal assistants and interact more naturally with humans. However, this ability also brings new security challenges. For example, when a user asks "How to practice running", if the visual environment shows that the user is standing on the edge of a cliff, then the model should be aware that running is very dangerous in this case and remind the user of the potential risks instead of directly answering how to run. ### Research Objectives 1. **Define the multimodal situational safety problem**: Propose an evaluation framework for multimodal situational safety to evaluate the safety of MLLMs in different situations. 2. **Create an evaluation benchmark**: Develop a multimodal situational safety benchmark (MSSBench), which contains 1,820 language - image pairs, with half of the situations being safe and the other half being unsafe. 3. **Evaluate the performance of existing models**: Evaluate the performance of existing MLLMs in handling safety and non - safety queries through MSSBench, and discover the deficiencies of existing models in this regard. 4. **Improve the safety of models**: Explore methods such as multi - agent reasoning pipelines to improve the situational safety awareness of MLLMs. ### Main Contributions 1. **Propose the concept of multimodal situational safety**: Define the multimodal situational safety problem for the first time and propose the corresponding evaluation framework. 2. **Create the MSSBench benchmark**: Provide a data set containing 1,820 language - image pairs for evaluating the situational safety of MLLMs. 3. **In - depth analysis of the performance bottlenecks of existing models**: Analyze the capabilities of MLLMs in explicit safety reasoning, visual understanding, and situational safety reasoning through different evaluation settings. 4. **Design a multi - agent reasoning pipeline**: Propose a method to improve the safety performance of MLLMs by decomposing subtasks. ### Conclusion The research shows that existing MLLMs have significant difficulties in identifying unsafe situations, especially performing poorly in the context of household tasks. Future research needs to further improve the situational safety awareness of MLLMs to ensure their safety in practical applications.

Multimodal Situational Safety

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models

Safety of Multimodal Large Language Models on Images and Texts

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

VLSBench: Unveiling Visual Leakage in Multimodal Safety

A Survey on Safe Multi-Modal Learning System

MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?

Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events

CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models

All Languages Matter: On the Multilingual Safety of Large Language Models

Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance

Revolutionizing Urban Safety Perception Assessments: Integrating Multimodal Large Language Models with Street View Images

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs