Abstract:Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities for capturing and reasoning over multimodal inputs. However, these models are prone to parametric knowledge conflicts, which arise from inconsistencies of represented knowledge between their vision and language components. In this paper, we formally define the problem of $\textbf{cross-modality parametric knowledge conflict}$ and present a systematic approach to detect, interpret, and mitigate them. We introduce a pipeline that identifies conflicts between visual and textual answers, showing a persistently high conflict rate across modalities in recent LVLMs regardless of the model size. We further investigate how these conflicts interfere with the inference process and propose a contrastive metric to discern the conflicting samples from the others. Building on these insights, we develop a novel dynamic contrastive decoding method that removes undesirable logits inferred from the less confident modality components based on answer confidence. For models that do not provide logits, we also introduce two prompt-based strategies to mitigate the conflicts. Our methods achieve promising improvements in accuracy on both the ViQuAE and InfoSeek datasets. Specifically, using LLaVA-34B, our proposed dynamic contrastive decoding improves an average accuracy of 2.24%.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of cross - modal parametric knowledge conflict in large - scale vision - language models (LVLMs). Specifically, the paper focuses on: 1. **Defining and researching cross - modal parametric knowledge conflict**: - For the first time, the paper systematically defines and studies the phenomenon of cross - modal parametric knowledge conflict in LVLMs. These conflicts stem from the fact that the visual and language components use different datasets and training objectives during the training process, resulting in inconsistent knowledge representation. 2. **Detecting cross - modal parametric knowledge conflict**: - The paper proposes a pipeline based on the multiple - choice question - answering format to detect these conflicts. By comparing the answers of different modalities (visual and text) to the same entity, it is determined whether there is a conflict. Research shows that even in models of different scales and architectures, the conflict rate is still high. 3. **Explaining cross - modal parametric knowledge conflict and its impact on the reasoning process**: - The paper explores how these conflicts affect the model's reasoning process and proposes a contrast index to identify conflict samples. The study finds that conflicts do not necessarily reduce the prediction confidence, but may introduce more confident but wrong answers instead. 4. **Methods to mitigate cross - modal parametric knowledge conflict**: - Based on the strong discrimination ability of the contrast index, the paper proposes a dynamic contrastive decoding method, which selectively removes unwanted logits from less reliable modalities according to the answer confidence. In addition, two prompt - based strategies are also proposed to deal with conflicts in models that cannot provide logits. ### Main contributions 1. **For the first time, define and study cross - modal parametric knowledge conflict in LVLMs**. 2. **Propose a practical pipeline for detecting conflicts and a contrast index for distinguishing conflict samples**. 3. **Introduce a dynamic contrastive decoding method and two prompt - based strategies to mitigate conflicts**. ### Formula summary - **Contrast index**: \[ \text{log}(p_{cd})=\text{log}(p_v)-\text{log}(p_t)=\text{log}\left(\frac{p_{VLM}(y_v|x_v, q)}{p_{VLM}(y_t|x_e, q)}\right)=\text{log}\left(\frac{p_{LM}(y_v|F(V(x_v)), \text{embed}(q))}{p_{LM}(y_t|\text{embed}(x_e), \text{embed}(q))}\right) \] where $ p_v $ and $ p_t $ are the probabilities of visual and text answers respectively, $ F(V(x_v)) $ is the projection of the visual encoder output, and $ \text{embed}(x_e) $ is the embedding of the text description. - **Conflict degree**: \[ |\text{log}(p_{cd})| \] Through these formulas and methods, the paper provides new perspectives and tools for understanding and mitigating cross - modal parametric knowledge conflict in LVLMs.

Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models

Insight Over Sight? Exploring the Vision-Knowledge Conflicts in Multimodal LLMs

Cross-Modal Consistency in Multimodal Large Language Models

Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models

Debiasing Multimodal Large Language Models

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

X-VILA: Cross-Modality Alignment for Large Language Model

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

Resolving Knowledge Conflicts in Large Language Models

LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models

Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective

Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA

Explaining Multi-modal Large Language Models by Analyzing their Vision Perception

Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond

Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models

Unified Lexical Representation for Interpretable Visual-Language Alignment

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding