Abstract:Recently, large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance across a wide range of tasks requiring perception and cognitive abilities. A key factor behind their success is their simple architecture, which consists of a vision encoder, a projector, and a large language model (LLM). Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks (e.g., MMVP) remains surprisingly low. This discrepancy raises the question of how LLVMs truly perceive images and exploit the advantages of the vision encoder. To address this, we systematically investigate this question regarding several aspects: permutation invariance, robustness, math reasoning, alignment preserving and importance, by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks. Our extensive experiments reveal several intriguing properties of current LLVMs: (1) they internally process the image in a global manner, even when the order of visual patch sequences is randomly permuted; (2) they are sometimes able to solve math problems without fully perceiving detailed numerical information; (3) the cross-modal alignment is overfitted to complex reasoning tasks, thereby, causing them to lose some of the original perceptual capabilities of their vision encoder; (4) the representation space in the lower layers (<25%) plays a crucial role in determining performance and enhancing visual understanding. Lastly, based on the above observations, we suggest potential future directions for building better LLVMs and constructing more challenging evaluation benchmarks.

On the Robustness of Multimodal Large Language Models

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks

On Evaluating Adversarial Robustness of Large Vision-Language Models

Visual Adversarial Examples Jailbreak Aligned Large Language Models

Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

Assessing Adversarial Robustness of Large Language Models: An Empirical Study

Towards Adversarial Attack on Vision-Language Pre-training Models

Break the Visual Perception: Adversarial Attacks Targeting Encoded Visual Tokens of Large Vision-Language Models

Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

White-box Multimodal Jailbreaks Against Large Vision-Language Models

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

Seeing is Deceiving: Exploitation of Visual Pathways in Multi-Modal Language Models

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

Misusing Tools in Large Language Models With Visual Adversarial Examples

Intriguing Properties of Large Language and Vision Models

Enhancing Advanced Visual Reasoning Ability of Large Language Models

Visual Perception by Large Language Model's Weights

Safety Alignment for Vision Language Models