Abstract:This work focuses on the potential of Vision LLMs (VLLMs) in visual reasoning. Different from prior studies, we shift our focus from evaluating standard performance to introducing a comprehensive safety evaluation suite, covering both out-of-distribution (OOD) generalization and adversarial robustness. For the OOD evaluation, we present two novel VQA datasets, each with one variant, designed to test model performance under challenging conditions. In exploring adversarial robustness, we propose a straightforward attack strategy for misleading VLLMs to produce visual-unrelated responses. Moreover, we assess the efficacy of two jailbreaking strategies, targeting either the vision or language component of VLLMs. Our evaluation of 21 diverse models, ranging from open-source VLLMs to GPT-4V, yields interesting observations: 1) Current VLLMs struggle with OOD texts but not images, unless the visual information is limited; and 2) These VLLMs can be easily misled by deceiving vision encoders only, and their vision-language training often compromise safety protocols. We release this safety evaluation suite at <a class="link-external link-https" href="https://github.com/UCSC-VLAA/vllm-safety-benchmark" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are to evaluate and improve the safety and robustness of Vision Large Language Models (Vision LLMs, VLLMs) when dealing with scenarios beyond the training data distribution (i.e., OOD, out - of - distribution) and adversarial attacks. Specifically: 1. **Evaluating performance in OOD scenarios**: - The author designed two new VQA (Visual Question Answering) datasets: OODCV - VQA and Sketchy - VQA and their variants, which are used to test the performance of VLLMs when facing uncommon images or sketches. - OODCV - VQA contains images under uncommon texture, weather, pose and other conditions, while Sketchy - VQA focuses on images in the form of sketches. 2. **Evaluating the robustness against adversarial attacks**: - A simple attack strategy was proposed, which misleads VLLMs to generate descriptions unrelated to the image by perturbing the image encoder of CLIP. - Two jailbreak attack strategies were further evaluated, attacking the visual and language components respectively, to induce VLLMs to generate toxic content. 3. **Revealing the current security risks of VLLMs**: - The research found that VLLMs perform poorly when dealing with OOD text instructions, especially when it comes to counterfactual questions. - Meanwhile, a simple visual encoder attack can effectively mislead VLLMs, but it is difficult to induce them to generate specific toxic content only by visual input. 4. **Putting forward improvement suggestions**: - It is emphasized that safety protocols need to be introduced during the visual - language training process to ensure the safety of VLLMs in practical applications. ### Summary This research aims to reveal the limitations of current VLLMs in OOD scenarios and under adversarial attacks by constructing a comprehensive security evaluation benchmark, and to provide directions for future research and improvement. Specifically, the paper proposed new datasets and attack methods to evaluate the performance of VLLMs in different situations, and discovered their significant weaknesses in some tasks. These findings are helpful to promote the development of safer and more reliable VLLMs.

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

VLSBench: Unveiling Visual Leakage in Multimodal Safety

Safety Alignment for Vision Language Models

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks

Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Intriguing Properties of Large Language and Vision Models

ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time

TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable Questions

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

Insight Over Sight? Exploring the Vision-Knowledge Conflicts in Multimodal LLMs

DARE: Diverse Visual Question Answering with Robustness Evaluation

The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense

Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning