Intriguing Properties of Large Language and Vision Models

Young-Jun Lee,Byungsoo Ko,Han-Gyu Kim,Yechan Hwang,Ho-Jin Choi
2024-10-07
Abstract:Recently, large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance across a wide range of tasks requiring perception and cognitive abilities. A key factor behind their success is their simple architecture, which consists of a vision encoder, a projector, and a large language model (LLM). Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks (e.g., MMVP) remains surprisingly low. This discrepancy raises the question of how LLVMs truly perceive images and exploit the advantages of the vision encoder. To address this, we systematically investigate this question regarding several aspects: permutation invariance, robustness, math reasoning, alignment preserving and importance, by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks. Our extensive experiments reveal several intriguing properties of current LLVMs: (1) they internally process the image in a global manner, even when the order of visual patch sequences is randomly permuted; (2) they are sometimes able to solve math problems without fully perceiving detailed numerical information; (3) the cross-modal alignment is overfitted to complex reasoning tasks, thereby, causing them to lose some of the original perceptual capabilities of their vision encoder; (4) the representation space in the lower layers (<25%) plays a crucial role in determining performance and enhancing visual understanding. Lastly, based on the above observations, we suggest potential future directions for building better LLVMs and constructing more challenging evaluation benchmarks.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue of large language and vision models (LLVMs) performing poorly on basic perception tasks, despite their excellent performance on high-level reasoning tasks. Specifically, the paper focuses on the following aspects: 1. **Permutation Invariance**: Investigating whether LLVMs can process images in a global manner, even when the sequence of image patches is randomly shuffled. 2. **Robustness**: Exploring the performance of LLVMs when faced with challenges such as occlusion. 3. **Mathematical Reasoning Ability**: Analyzing the performance of LLVMs when handling images containing detailed numerical information, especially in solving mathematical problems. 4. **Cross-Modal Alignment**: Evaluating whether LLVMs retain the perceptual capabilities of their original visual encoders after alignment and visual instruction fine-tuning. 5. **Importance**: Studying the mechanism of visual information processing in different layers of LLVMs, especially the lower layers. Through these studies, the paper aims to reveal some interesting characteristics of current LLVMs and provide potential directions for building better LLVMs in the future.