VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

Jiaqi Wang,Yifei Gao,Jitao Sang
2024-11-24
Abstract:Large Vision-Language Models (LVLMs) have demonstrated outstanding performance in multimodal task reasoning. However, they often generate responses that appear plausible yet do not accurately reflect the visual content, a phenomenon known as hallucination. Recent approaches have introduced training-free methods that mitigate hallucinations by adjusting the decoding strategy during inference stage, typically attributing hallucination to the language model itself. Our analysis, however, reveals that distortions in the visual encoding process significantly affect the model's reasoning accuracy. Specifically, earlier visual layers may retain key features but gradually distort as the information propagates toward the output layer. Building on these findings, we propose a novel hallucination-mitigation method from the visual encoding perspective: \textbf{V}isu\textbf{a}l \textbf{L}ayer Fus\textbf{i}on Contrastive \textbf{D}ecoding (VaLiD). This method utilizes uncertainty to guide the selection of visual hidden layers, correcting distortions in the visual encoding process and thereby improving the reliability of generated text. Experimental results show that VaLiD effectively reduces hallucinations across various benchmarks, achieving state-of-the-art performance compared to multiple baseline methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper aims to solve the hallucination problem that occurs when large - scale vision - language models (LVLMs) generate responses. Specifically, although LVLMs perform well in multimodal task reasoning, they sometimes generate responses that seem reasonable but are actually inconsistent with the visual content, a phenomenon known as "hallucination". Most existing methods reduce hallucination by adjusting the decoding strategy, and it is usually considered that hallucination mainly originates from the language model itself. However, through analysis, the authors of this paper found that the distortion in the visual encoding process also significantly affects the reasoning accuracy of the model. Therefore, they proposed a new method to reduce hallucination from the perspective of visual encoding: Visual - layer Fusion with Contrastive Decoding (VaLiD). ### Main contributions 1. **Emphasizing the importance of the visual encoder**: The paper points out the importance of the visual encoder in the LVLM reasoning stage and discusses the hallucination problem of LVLMs from the perspective of visual encoding. 2. **Revealing the existence of visual information distortion**: Through experiments, it is shown that multiple LVLMs have information distortion during the visual encoding process, and these distortions significantly contribute to the hallucination in the model responses. 3. **Proposing the training - independent method VaLiD**: Use uncertainty to guide the construction of the reference distribution and correct hallucination responses through contrastive decoding. 4. **Extensive experimental verification**: The experimental results show that VaLiD effectively reduces hallucination, improves the reliability and accuracy of the generated text, and achieves state - of - the - art performance in multiple benchmarks. ### Method overview #### 3.1 Visual encoding distortion leading to hallucination - **Finding 1: Visual - layer encoding distortion leads to most prediction reversals**: By analyzing the decoding results of the early visual layers, the Encoding Distortion Rate (EDR) is introduced to quantify the potential distortion in the visual encoding process. The experimental results show that many samples can be correctly decoded in the early visual hidden layers, but unexpectedly decoded incorrectly in the standard output layer. - **Finding 2: Uncertainty as an indicator of encoding distortion**: Use uncertainty to show the transmission change of visual information in the hidden layers of the visual encoder. Through statistical analysis, it is found that the hidden layers with incorrect decoding results have higher uncertainty, while the layers with correct decoding have lower uncertainty. #### 3.2 Visual - layer fusion with contrastive decoding - **Uncertainty - guided visual - layer fusion**: Use the early high - uncertainty visual layers to improve the output correction in contrastive decoding. To avoid over - relying on the unstable information of a single layer, a layer - fusion method is introduced to dynamically select the top k layers with the highest uncertainty in each step to construct a candidate layer set. - **Contrastive decoding**: Calculate a new probability distribution by comparing the probability distribution of the standard output layer features and the feature probability distribution of the high - uncertainty layers. An adaptive reliability constraint is adopted to prevent the newly generated distribution from mis - penalizing reasonable outputs. ### Experimental results - **POPE benchmark**: On the POPE benchmark, VaLiD significantly outperforms other baseline methods in random, popular, and adversarial settings, especially showing more stable performance improvement in popular and adversarial sampling conditions. - **AMBER benchmark**: On the AMBER benchmark, VaLiD performs well in dealing with problems related to object actions and quantity attributes, and is superior to contrastive decoding methods relying on data augmentation, such as VCD and Ritual. ### Conclusion By starting from the perspective of visual encoding, VaLiD effectively alleviates the hallucination problem in LVLMs and improves the reliability and accuracy of the generated text. This method not only achieves state - of - the - art performance in multiple benchmarks but also provides a new perspective for future research.