Abstract:Does seeing always mean knowing? Large Vision-Language Models (LVLMs) integrate separately pre-trained vision and language components, often using CLIP-ViT as vision backbone. However, these models frequently encounter a core issue of "cognitive misalignment" between the vision encoder (VE) and the large language model (LLM). Specifically, the VE's representation of visual information may not fully align with LLM's cognitive framework, leading to a mismatch where visual features exceed the language model's interpretive range. To address this, we investigate how variations in VE representations influence LVLM comprehension, especially when the LLM faces VE-Unknown data-images whose ambiguous visual representations challenge the VE's interpretive precision. Accordingly, we construct a multi-granularity landmark dataset and systematically examine the impact of VE-Known and VE-Unknown data on interpretive abilities. Our results show that VE-Unknown data limits LVLM's capacity for accurate understanding, while VE-Known data, rich in distinctive features, helps reduce cognitive misalignment. Building on these insights, we propose Entity-Enhanced Cognitive Alignment (EECA), a method that employs multi-granularity supervision to generate visually enriched, well-aligned tokens that not only integrate within the LLM's embedding space but also align with the LLM's cognitive framework. This alignment markedly enhances LVLM performance in landmark recognition. Our findings underscore the challenges posed by VE-Unknown data and highlight the essential role of cognitive alignment in advancing multimodal systems.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **the cognitive dissonance between the visual encoder (VE) and the large - language model (LLM) in the vision - language large model (LVLM)**. Specifically, although these models perform well in handling complex tasks, they still face challenges in basic recognition tasks, such as being unable to accurately identify landmarks in images. ### Specific manifestations of cognitive dissonance 1. **Visual features beyond the interpretive scope of the language model**: The visual information representation generated by the visual encoder may not be fully aligned with the cognitive framework of the language model, resulting in visual features being beyond the interpretive scope of the language model. 2. **Challenges of visually unknown data**: When facing data unknown to the visual encoder (i.e., images with ambiguous visual representations), the performance of LVLM will decline significantly because these data challenge the interpretive precision of the visual encoder. ### Solutions To address these problems, the author proposes the following methods: 1. **Constructing a multi - granularity landmark dataset (MGLD)**: By systematically evaluating the impact of visually known (VE - Known) and visually unknown (VE - Unknown) data on interpretive ability, the author found that visually known data helps reduce cognitive dissonance, while visually unknown data limits the understanding ability of LVLM. 2. **Proposing the entity - enhanced cognitive alignment (EECA) method**: This method generates visually rich and well - aligned tokens through multi - level supervision. These tokens can not only be integrated into the embedding space of the language model but also be aligned with the cognitive framework of the language model. This significantly enhances the performance of LVLM in landmark recognition. ### Experimental results The experimental results show that the EECA method outperforms the baseline model on both visually known and visually unknown data, especially achieving significant improvement on visually known data. In addition, the study also emphasizes the importance of data quality, pointing out that high - quality visually known data can improve the performance of LVLM more than a large amount of low - quality data. ### Formula representation The formulas involved in the paper are represented in Markdown format as follows: 1. **CLIP similarity calculation**: \[ \text{Sim}_{\text{CLIP}}(I_i, T_j)=\frac{\langle f_v(I_i), f_t(T_j)\rangle}{\|f_v(I_i)\|\|f_t(T_j)\|} \] where \( f_v(I_i) \) and \( f_t(T_j) \) represent the visual and text embeddings of the image and the landmark name, respectively. 2. **Entity - aware contrastive loss**: \[ L_e =-\frac{1}{2B}\sum_{i = 1}^{B}\sum_{j = 1}^{E_i}\left[\log\frac{\exp(S(X_{ei,j},\tilde{X}_{ei,j})/\tau)}{\sum_{k = 1}^{E_i}\exp(S(X_{ei,j},\tilde{X}_{ei,k})/\tau)}+\log\frac{\exp(S(\tilde{X}_{ei,j},X_{ei,j})/\tau)}{\sum_{k = 1}^{E_i}\exp(S(\tilde{X}_{ei,j},X_{ei,k})/\tau)}\right] \] where \( B \) is the batch size, \( E_i \) is the number of entities in the \( i \)-th image, \( S(a, b)=\frac{a\cdot b}{\|a\|\|b\|} \), and \( \tau \) is the temperature parameter. 3. **Hierarchical classification loss**: By connecting high - resolution and low - resolution visual tokens and then performing average pooling to obtain a comprehensive representation \( h_i \), which is used to calculate the hierarchical classification loss. These formulas show how the author improves the cognitive alignment of LVLM by quantifying the similarity between visual and text embeddings and introducing contrastive loss and hierarchical classification loss.

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

Enhancing Advanced Visual Reasoning Ability of Large Language Models

RelationVLM: Making Large Vision-Language Models Understand Visual Relations

Unified Lexical Representation for Interpretable Visual-Language Alignment

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Insight Over Sight? Exploring the Vision-Knowledge Conflicts in Multimodal LLMs

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Visually-Augmented Language Modeling

Large Vision-Language Models as Emotion Recognizers in Context Awareness

InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding

Visual In-Context Learning for Large Vision-Language Models

Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models

From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks

Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

A-VL: Adaptive Attention for Large Vision-Language Models

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

EVLM: An Efficient Vision-Language Model for Visual Understanding

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders