How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

Haoqin Tu,Chenhang Cui,Zijun Wang,Yiyang Zhou,Bingchen Zhao,Junlin Han,Wangchunshu Zhou,Huaxiu Yao,Cihang Xie
DOI: https://doi.org/10.48550/arXiv.2311.16101
2023-11-28
Abstract:This work focuses on the potential of Vision LLMs (VLLMs) in visual reasoning. Different from prior studies, we shift our focus from evaluating standard performance to introducing a comprehensive safety evaluation suite, covering both out-of-distribution (OOD) generalization and adversarial robustness. For the OOD evaluation, we present two novel VQA datasets, each with one variant, designed to test model performance under challenging conditions. In exploring adversarial robustness, we propose a straightforward attack strategy for misleading VLLMs to produce visual-unrelated responses. Moreover, we assess the efficacy of two jailbreaking strategies, targeting either the vision or language component of VLLMs. Our evaluation of 21 diverse models, ranging from open-source VLLMs to GPT-4V, yields interesting observations: 1) Current VLLMs struggle with OOD texts but not images, unless the visual information is limited; and 2) These VLLMs can be easily misled by deceiving vision encoders only, and their vision-language training often compromise safety protocols. We release this safety evaluation suite at <a class="link-external link-https" href="https://github.com/UCSC-VLAA/vllm-safety-benchmark" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are to evaluate and improve the safety and robustness of Vision Large Language Models (Vision LLMs, VLLMs) when dealing with scenarios beyond the training data distribution (i.e., OOD, out - of - distribution) and adversarial attacks. Specifically: 1. **Evaluating performance in OOD scenarios**: - The author designed two new VQA (Visual Question Answering) datasets: OODCV - VQA and Sketchy - VQA and their variants, which are used to test the performance of VLLMs when facing uncommon images or sketches. - OODCV - VQA contains images under uncommon texture, weather, pose and other conditions, while Sketchy - VQA focuses on images in the form of sketches. 2. **Evaluating the robustness against adversarial attacks**: - A simple attack strategy was proposed, which misleads VLLMs to generate descriptions unrelated to the image by perturbing the image encoder of CLIP. - Two jailbreak attack strategies were further evaluated, attacking the visual and language components respectively, to induce VLLMs to generate toxic content. 3. **Revealing the current security risks of VLLMs**: - The research found that VLLMs perform poorly when dealing with OOD text instructions, especially when it comes to counterfactual questions. - Meanwhile, a simple visual encoder attack can effectively mislead VLLMs, but it is difficult to induce them to generate specific toxic content only by visual input. 4. **Putting forward improvement suggestions**: - It is emphasized that safety protocols need to be introduced during the visual - language training process to ensure the safety of VLLMs in practical applications. ### Summary This research aims to reveal the limitations of current VLLMs in OOD scenarios and under adversarial attacks by constructing a comprehensive security evaluation benchmark, and to provide directions for future research and improvement. Specifically, the paper proposed new datasets and attack methods to evaluate the performance of VLLMs in different situations, and discovered their significant weaknesses in some tasks. These findings are helpful to promote the development of safer and more reliable VLLMs.