Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models

Tinghui Zhu,Qin Liu,Fei Wang,Zhengzhong Tu,Muhao Chen
2024-10-11
Abstract:Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities for capturing and reasoning over multimodal inputs. However, these models are prone to parametric knowledge conflicts, which arise from inconsistencies of represented knowledge between their vision and language components. In this paper, we formally define the problem of $\textbf{cross-modality parametric knowledge conflict}$ and present a systematic approach to detect, interpret, and mitigate them. We introduce a pipeline that identifies conflicts between visual and textual answers, showing a persistently high conflict rate across modalities in recent LVLMs regardless of the model size. We further investigate how these conflicts interfere with the inference process and propose a contrastive metric to discern the conflicting samples from the others. Building on these insights, we develop a novel dynamic contrastive decoding method that removes undesirable logits inferred from the less confident modality components based on answer confidence. For models that do not provide logits, we also introduce two prompt-based strategies to mitigate the conflicts. Our methods achieve promising improvements in accuracy on both the ViQuAE and InfoSeek datasets. Specifically, using LLaVA-34B, our proposed dynamic contrastive decoding improves an average accuracy of 2.24%.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of cross - modal parametric knowledge conflict in large - scale vision - language models (LVLMs). Specifically, the paper focuses on: 1. **Defining and researching cross - modal parametric knowledge conflict**: - For the first time, the paper systematically defines and studies the phenomenon of cross - modal parametric knowledge conflict in LVLMs. These conflicts stem from the fact that the visual and language components use different datasets and training objectives during the training process, resulting in inconsistent knowledge representation. 2. **Detecting cross - modal parametric knowledge conflict**: - The paper proposes a pipeline based on the multiple - choice question - answering format to detect these conflicts. By comparing the answers of different modalities (visual and text) to the same entity, it is determined whether there is a conflict. Research shows that even in models of different scales and architectures, the conflict rate is still high. 3. **Explaining cross - modal parametric knowledge conflict and its impact on the reasoning process**: - The paper explores how these conflicts affect the model's reasoning process and proposes a contrast index to identify conflict samples. The study finds that conflicts do not necessarily reduce the prediction confidence, but may introduce more confident but wrong answers instead. 4. **Methods to mitigate cross - modal parametric knowledge conflict**: - Based on the strong discrimination ability of the contrast index, the paper proposes a dynamic contrastive decoding method, which selectively removes unwanted logits from less reliable modalities according to the answer confidence. In addition, two prompt - based strategies are also proposed to deal with conflicts in models that cannot provide logits. ### Main contributions 1. **For the first time, define and study cross - modal parametric knowledge conflict in LVLMs**. 2. **Propose a practical pipeline for detecting conflicts and a contrast index for distinguishing conflict samples**. 3. **Introduce a dynamic contrastive decoding method and two prompt - based strategies to mitigate conflicts**. ### Formula summary - **Contrast index**: \[ \text{log}(p_{cd})=\text{log}(p_v)-\text{log}(p_t)=\text{log}\left(\frac{p_{VLM}(y_v|x_v, q)}{p_{VLM}(y_t|x_e, q)}\right)=\text{log}\left(\frac{p_{LM}(y_v|F(V(x_v)), \text{embed}(q))}{p_{LM}(y_t|\text{embed}(x_e), \text{embed}(q))}\right) \] where \( p_v \) and \( p_t \) are the probabilities of visual and text answers respectively, \( F(V(x_v)) \) is the projection of the visual encoder output, and \( \text{embed}(x_e) \) is the embedding of the text description. - **Conflict degree**: \[ |\text{log}(p_{cd})| \] Through these formulas and methods, the paper provides new perspectives and tools for understanding and mitigating cross - modal parametric knowledge conflict in LVLMs.