Intriguing Properties of Large Language and Vision Models

Young-Jun Lee,Byungsoo Ko,Han-Gyu Kim,Yechan Hwang,Ho-Jin Choi

2024-10-07

Abstract:Recently, large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance across a wide range of tasks requiring perception and cognitive abilities. A key factor behind their success is their simple architecture, which consists of a vision encoder, a projector, and a large language model (LLM). Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks (e.g., MMVP) remains surprisingly low. This discrepancy raises the question of how LLVMs truly perceive images and exploit the advantages of the vision encoder. To address this, we systematically investigate this question regarding several aspects: permutation invariance, robustness, math reasoning, alignment preserving and importance, by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks. Our extensive experiments reveal several intriguing properties of current LLVMs: (1) they internally process the image in a global manner, even when the order of visual patch sequences is randomly permuted; (2) they are sometimes able to solve math problems without fully perceiving detailed numerical information; (3) the cross-modal alignment is overfitted to complex reasoning tasks, thereby, causing them to lose some of the original perceptual capabilities of their vision encoder; (4) the representation space in the lower layers (<25%) plays a crucial role in determining performance and enhancing visual understanding. Lastly, based on the above observations, we suggest potential future directions for building better LLVMs and constructing more challenging evaluation benchmarks.

Computer Vision and Pattern Recognition,Computation and Language

What problem does this paper attempt to address?

The paper attempts to address the issue of large language and vision models (LLVMs) performing poorly on basic perception tasks, despite their excellent performance on high-level reasoning tasks. Specifically, the paper focuses on the following aspects: 1. **Permutation Invariance**: Investigating whether LLVMs can process images in a global manner, even when the sequence of image patches is randomly shuffled. 2. **Robustness**: Exploring the performance of LLVMs when faced with challenges such as occlusion. 3. **Mathematical Reasoning Ability**: Analyzing the performance of LLVMs when handling images containing detailed numerical information, especially in solving mathematical problems. 4. **Cross-Modal Alignment**: Evaluating whether LLVMs retain the perceptual capabilities of their original visual encoders after alignment and visual instruction fine-tuning. 5. **Importance**: Studying the mechanism of visual information processing in different layers of LLVMs, especially the lower layers. Through these studies, the paper aims to reveal some interesting characteristics of current LLVMs and provide potential directions for building better LLVMs in the future.

Intriguing Properties of Large Language and Vision Models

MoAI: Mixture of All Intelligence for Large Language and Vision Models

Phantom of Latent for Large Language and Vision Models

Enhancing Advanced Visual Reasoning Ability of Large Language Models

Effectiveness Assessment of Recent Large Vision-Language Models

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

TroL: Traversal of Layers for Large Language and Vision Models

Do better language models have crisper vision?

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

On the Robustness of Multimodal Large Language Models

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

Can Large Language Models Understand Symbolic Graphics Programs?

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification