Abstract:Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. IFG introduces a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs. Then, IFG further proposes a novel decoding strategy using the soft visual prompt to mitigate the model's over-reliance on adjacent text inputs. Comprehensive experiments demonstrate that our method effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without requiring additional training resources or data. The code and model are available at [<a class="link-external link-http" href="http://lacing-lvlm.github.io" rel="external noopener nofollow">this http URL</a>](<a class="link-external link-https" href="https://lacing-lvlm.github.io" rel="external noopener nofollow">this https URL</a>).

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the language bias problem in large - scale visual - language models (LVLMs). Specifically, although LVLMs have achieved impressive results in various visual - language tasks, they often generate hallucinations due to language bias, resulting in reduced attention to images and insufficient visual understanding ability. The paper points out that language bias is mainly caused by the following two reasons: 1. **Data scale difference between the pre - training stage and the multi - modal alignment stage**: The LLM part of LVLMs is pre - trained on large - scale text corpora, while the number of training samples used in the multi - modal alignment stage is small and the training time is short. This gap in data scale makes the pre - training distribution dominant, causing LVLMs to be unable to fully utilize visual inputs. 2. **Inference bias due to short - term dependence of text data**: In text data, the correlation between a word and its adjacent words is strong, while the correlation with distant words is weak. LLMs are prone to capture this short - term dependence when pre - trained on large - scale text corpora and assign higher attention weights to adjacent words when processing text data. However, this pattern may be inappropriate when processing multi - modal inputs, causing LVLMs to overly rely on text inputs and ignore the actual visual inputs. To solve these problems, the paper proposes the LACING framework, which includes two core mechanisms: 1. **Multimodal Dual - Attention Mechanism (MDA)**: By introducing a parallel dual - attention mechanism, the attention weights of visual and text inputs are calculated separately, and then these two weights are fused to obtain the final attention map. This ensures that LVLMs can fully focus on visual inputs in all layers while retaining the causal attention of text inputs. 2. **Soft - Image Guidance (SIG)**: By introducing a learnable soft visual cue to replace the visual input, a multimodal - null input is constructed. This soft visual cue serves as a placeholder, maintaining the consistency of the input mode, while forcing the model to give priority to text inputs during the inference process. SIG also proposes a new decoding strategy, using the soft visual cue to reduce the model's over - reliance on adjacent text inputs. Through these methods, the LACING framework effectively reduces the language bias of LVLMs, enhances visual understanding ability, reduces the hallucination phenomenon, and does not require additional training resources or data. Experimental results show that this method significantly improves the performance of LVLMs in multiple benchmark tests.

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Debiasing Multimodal Large Language Models

Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs

Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding

AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality

Mitigating Multilingual Hallucination in Large Vision-Language Models

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

Reducing Hallucinations in Vision-Language Models via Latent Space Steering

Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding

V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

Calibrated Self-Rewarding Vision Language Models

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

InfMLLM: A Unified Framework for Visual-Language Tasks.

Unified Generative and Discriminative Training for Multi-modal Large Language Models