Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs

Sreyan Ghosh,Chandra Kiran Reddy Evuru,Sonal Kumar,Utkarsh Tyagi,Oriol Nieto,Zeyu Jin,Dinesh Manocha
2024-10-12
Abstract:Large Vision-Language Models (LVLMs) often produce responses that misalign with factual information, a phenomenon known as hallucinations. While hallucinations are well-studied, the exact causes behind them remain underexplored. In this paper, we first investigate the root causes of hallucinations in LVLMs. Our findings reveal that existing mitigation techniques primarily reduce hallucinations for visual recognition prompts-those that require simple descriptions of visual elements-but fail for cognitive prompts that demand deliberate reasoning. We identify the core issue as a lack of true visual perception in LVLMs: although they can accurately recognize visual elements, they struggle to fully interpret these elements in the context of the input prompt and effectively link this recognition to their internal knowledge, which is critical for reasoning. To address this gap, we introduce Visual Description Grounded Decoding (VDGD), a simple, robust, and training-free method designed to enhance visual perception and improve reasoning capabilities in LVLMs. VDGD works by first generating a detailed description of the image and appending it as a prefix to the instruction. During response generation, tokens are sampled based on their KL divergence to the description, favoring candidates with lower divergence. Experimental results on multiple visual reasoning benchmarks and LVLMs demonstrate that VDGD consistently outperforms existing baselines 2% - 33%. Finally, we introduce VaLLu, a benchmark designed for comprehensive evaluation of the cognitive capabilities of LVLMs.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
This paper attempts to solve the problem of large vision - language models (LVLMs) generating responses that are inconsistent with factual information, namely the so - called "hallucination" phenomenon. Specifically, the author found that the existing hallucination - mitigation techniques mainly focus on reducing hallucinations in visual recognition tasks, and are not effective for cognitive tasks that require reasoning. This indicates that LVLMs have insufficient visual perception capabilities when handling complex reasoning tasks. ### Main problems and solutions in the paper 1. **Research background**: - Large vision - language models (LVLMs) often produce outputs that are inconsistent with facts when generating responses, and this phenomenon is called "hallucination". - Although the hallucination phenomenon has been widely studied, its root cause has not been fully clarified. 2. **Limitations of existing techniques**: - The existing hallucination - mitigation techniques are mainly applicable to simple visual recognition tasks (such as describing objects in an image), but are not effective for cognitive tasks that require reasoning. - These techniques perform well when processing images in real - life scenarios, but have limited effectiveness when processing non - real - life scenarios or tasks that require reasoning. 3. **Core problems**: - Through experiments, the author found that although LVLMs can accurately identify visual elements in an image, they have difficulty effectively associating these elements with input prompts and combining internal knowledge for reasoning. - This visual perception gap leads to the hallucination phenomenon in cognitive tasks, which in turn affects the model's reasoning ability. 4. **Proposed solutions**: - To solve this problem, the author proposed **Visual Description Grounded Decoding (VDGD)**, a simple and robust method that does not require training. - The working principle of VDGD is to first generate a detailed description of the image and attach it as a prefix to the instruction. When generating a response, calculate the probability distribution of candidate words according to the generated description, and give priority to selecting words with a higher similarity to the description. 5. **Experimental results**: - The experimental results show that VDGD significantly outperforms the existing baseline methods in multiple visual - reasoning benchmark tests, with a performance improvement ranging from 2% to 33%. - To comprehensively evaluate the cognitive abilities of LVLMs, the author also introduced a new benchmark test set, VaLLu, which contains 1,500 carefully selected instances and focuses on questions that require open - ended answers. ### Summary This paper, through in - depth analysis of the hallucination phenomenon in LVLMs when generating responses, reveals the limitations of the existing hallucination - mitigation techniques and proposes a visual - description - based decoding method to improve the model's reasoning ability. The experimental results show that VDGD performs well in multiple benchmark tests, significantly reducing the hallucination phenomenon and enhancing the cognitive abilities of LVLMs.