Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Haozhe Zhao,Shuzheng Si,Liang Chen,Yichi Zhang,Maosong Sun,Mingjia Zhang,Baobao Chang
2024-11-22
Abstract:Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. IFG introduces a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs. Then, IFG further proposes a novel decoding strategy using the soft visual prompt to mitigate the model's over-reliance on adjacent text inputs. Comprehensive experiments demonstrate that our method effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without requiring additional training resources or data. The code and model are available at [<a class="link-external link-http" href="http://lacing-lvlm.github.io" rel="external noopener nofollow">this http URL</a>](<a class="link-external link-https" href="https://lacing-lvlm.github.io" rel="external noopener nofollow">this https URL</a>).
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the language bias problem in large - scale visual - language models (LVLMs). Specifically, although LVLMs have achieved impressive results in various visual - language tasks, they often generate hallucinations due to language bias, resulting in reduced attention to images and insufficient visual understanding ability. The paper points out that language bias is mainly caused by the following two reasons: 1. **Data scale difference between the pre - training stage and the multi - modal alignment stage**: The LLM part of LVLMs is pre - trained on large - scale text corpora, while the number of training samples used in the multi - modal alignment stage is small and the training time is short. This gap in data scale makes the pre - training distribution dominant, causing LVLMs to be unable to fully utilize visual inputs. 2. **Inference bias due to short - term dependence of text data**: In text data, the correlation between a word and its adjacent words is strong, while the correlation with distant words is weak. LLMs are prone to capture this short - term dependence when pre - trained on large - scale text corpora and assign higher attention weights to adjacent words when processing text data. However, this pattern may be inappropriate when processing multi - modal inputs, causing LVLMs to overly rely on text inputs and ignore the actual visual inputs. To solve these problems, the paper proposes the LACING framework, which includes two core mechanisms: 1. **Multimodal Dual - Attention Mechanism (MDA)**: By introducing a parallel dual - attention mechanism, the attention weights of visual and text inputs are calculated separately, and then these two weights are fused to obtain the final attention map. This ensures that LVLMs can fully focus on visual inputs in all layers while retaining the causal attention of text inputs. 2. **Soft - Image Guidance (SIG)**: By introducing a learnable soft visual cue to replace the visual input, a multimodal - null input is constructed. This soft visual cue serves as a placeholder, maintaining the consistency of the input mode, while forcing the model to give priority to text inputs during the inference process. SIG also proposes a new decoding strategy, using the soft visual cue to reduce the model's over - reliance on adjacent text inputs. Through these methods, the LACING framework effectively reduces the language bias of LVLMs, enhances visual understanding ability, reduces the hallucination phenomenon, and does not require additional training resources or data. Experimental results show that this method significantly improves the performance of LVLMs in multiple benchmark tests.