Yaqi Zhao,Yuanyang Yin,Lin Li,Mingan Lin,Victor Shea-Jay Huang,Siwei Chen,Weipeng Chen,Baoqun Yin,Zenan Zhou,Wentao Zhang
Abstract:Does seeing always mean knowing? Large Vision-Language Models (LVLMs) integrate separately pre-trained vision and language components, often using CLIP-ViT as vision backbone. However, these models frequently encounter a core issue of "cognitive misalignment" between the vision encoder (VE) and the large language model (LLM). Specifically, the VE's representation of visual information may not fully align with LLM's cognitive framework, leading to a mismatch where visual features exceed the language model's interpretive range. To address this, we investigate how variations in VE representations influence LVLM comprehension, especially when the LLM faces VE-Unknown data-images whose ambiguous visual representations challenge the VE's interpretive precision. Accordingly, we construct a multi-granularity landmark dataset and systematically examine the impact of VE-Known and VE-Unknown data on interpretive abilities. Our results show that VE-Unknown data limits LVLM's capacity for accurate understanding, while VE-Known data, rich in distinctive features, helps reduce cognitive misalignment. Building on these insights, we propose Entity-Enhanced Cognitive Alignment (EECA), a method that employs multi-granularity supervision to generate visually enriched, well-aligned tokens that not only integrate within the LLM's embedding space but also align with the LLM's cognitive framework. This alignment markedly enhances LVLM performance in landmark recognition. Our findings underscore the challenges posed by VE-Unknown data and highlight the essential role of cognitive alignment in advancing multimodal systems.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **the cognitive dissonance between the visual encoder (VE) and the large - language model (LLM) in the vision - language large model (LVLM)**. Specifically, although these models perform well in handling complex tasks, they still face challenges in basic recognition tasks, such as being unable to accurately identify landmarks in images.
### Specific manifestations of cognitive dissonance
1. **Visual features beyond the interpretive scope of the language model**: The visual information representation generated by the visual encoder may not be fully aligned with the cognitive framework of the language model, resulting in visual features being beyond the interpretive scope of the language model.
2. **Challenges of visually unknown data**: When facing data unknown to the visual encoder (i.e., images with ambiguous visual representations), the performance of LVLM will decline significantly because these data challenge the interpretive precision of the visual encoder.
### Solutions
To address these problems, the author proposes the following methods:
1. **Constructing a multi - granularity landmark dataset (MGLD)**: By systematically evaluating the impact of visually known (VE - Known) and visually unknown (VE - Unknown) data on interpretive ability, the author found that visually known data helps reduce cognitive dissonance, while visually unknown data limits the understanding ability of LVLM.
2. **Proposing the entity - enhanced cognitive alignment (EECA) method**: This method generates visually rich and well - aligned tokens through multi - level supervision. These tokens can not only be integrated into the embedding space of the language model but also be aligned with the cognitive framework of the language model. This significantly enhances the performance of LVLM in landmark recognition.
### Experimental results
The experimental results show that the EECA method outperforms the baseline model on both visually known and visually unknown data, especially achieving significant improvement on visually known data. In addition, the study also emphasizes the importance of data quality, pointing out that high - quality visually known data can improve the performance of LVLM more than a large amount of low - quality data.
### Formula representation
The formulas involved in the paper are represented in Markdown format as follows:
1. **CLIP similarity calculation**:
\[
\text{Sim}_{\text{CLIP}}(I_i, T_j)=\frac{\langle f_v(I_i), f_t(T_j)\rangle}{\|f_v(I_i)\|\|f_t(T_j)\|}
\]
where \( f_v(I_i) \) and \( f_t(T_j) \) represent the visual and text embeddings of the image and the landmark name, respectively.
2. **Entity - aware contrastive loss**:
\[
L_e =-\frac{1}{2B}\sum_{i = 1}^{B}\sum_{j = 1}^{E_i}\left[\log\frac{\exp(S(X_{ei,j},\tilde{X}_{ei,j})/\tau)}{\sum_{k = 1}^{E_i}\exp(S(X_{ei,j},\tilde{X}_{ei,k})/\tau)}+\log\frac{\exp(S(\tilde{X}_{ei,j},X_{ei,j})/\tau)}{\sum_{k = 1}^{E_i}\exp(S(\tilde{X}_{ei,j},X_{ei,k})/\tau)}\right]
\]
where \( B \) is the batch size, \( E_i \) is the number of entities in the \( i \)-th image, \( S(a, b)=\frac{a\cdot b}{\|a\|\|b\|} \), and \( \tau \) is the temperature parameter.
3. **Hierarchical classification loss**:
By connecting high - resolution and low - resolution visual tokens and then performing average pooling to obtain a comprehensive representation \( h_i \), which is used to calculate the hierarchical classification loss.
These formulas show how the author improves the cognitive alignment of LVLM by quantifying the similarity between visual and text embeddings and introducing contrastive loss and hierarchical classification loss.