On the Robustness of Multimodal Large Language Models

AI Meta,Young Kyun Jang,Alejandro Aparcedo,Ser-Nam Lim
Abstract:Visual Large Language Models (VLLMs) have shown promising capabilities in understanding visual context. In this study, we investigate the performance of a VLLM, LLaVA in a visual question answering task after augmenting the input image with noise, a rotation, crop, etc. We further probe the resilience of VLLMs under adversarial conditions, specifically when the vision encoder is subjected to adversarial attacks. Our findings reveal that our VLLM’s ability to understanding visual context is minimally impacted by augmenting the input image. We discover that our VLLM exhibits reduced susceptibility to adversarial attacks. This crucial insight suggests that the integration of a Large Language Model (LLM) as a language decoder, coupled with a vision encoder, could potentially serve as a countermeasure against adversarial attacks.
Computer Science
What problem does this paper attempt to address?