V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

Yuxi Xie,Guanzhen Li,Xiao Xu,Min-Yen Kan
2024-11-05
Abstract:Large vision-language models (LVLMs) suffer from hallucination, resulting in misalignment between the output textual response and the input visual content. Recent research indicates that the over-reliance on the Large Language Model (LLM) backbone, as one cause of the LVLM hallucination, inherently introduces bias from language priors, leading to insufficient context attention to the visual inputs. We tackle this issue of hallucination by mitigating such over-reliance through preference learning. We propose Vision-guided Direct Preference Optimization (V-DPO) to enhance visual context learning at training time. To interpret the effectiveness and generalizability of V-DPO on different types of training data, we construct a synthetic dataset containing both response- and image-contrast preference pairs, compared against existing human-annotated hallucination samples. Our approach achieves significant improvements compared with baseline methods across various hallucination benchmarks. Our analysis indicates that V-DPO excels in learning from image-contrast preference data, demonstrating its superior ability to elicit and understand nuances of visual context. Our code is publicly available at <a class="link-external link-https" href="https://github.com/YuxiXie/V-DPO" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the hallucination problem in large - scale vision - language models (LVLMs). Specifically, hallucination refers to the phenomenon where the text output generated by the model does not match the input visual content, resulting in inaccurate or completely wrong descriptions. This phenomenon is particularly evident when dealing with non - conventional images. #### Reasons for the Hallucination Problem 1. **Over - reliance on large - scale language models (LLMs)**: Existing LVLMs usually integrate pre - trained LLMs as their backbone architectures. However, this integration method may cause the model to rely too much on language patterns and ignore the context information of visual inputs. 2. **Bias of language priors**: Since LLMs themselves have strong language generation capabilities, they may introduce biases from language priors, making the model more inclined to follow existing language patterns when generating text rather than generating according to the actual visual inputs. 3. **Insufficient context attention**: Existing models fail to fully pay attention to the visual context when processing visual and text modalities, resulting in the generated content not being well - aligned with the visual inputs. #### Solutions To solve the above problems, the authors propose the **Vision - guided Direct Preference Optimization (V - DPO)** method. V - DPO enhances visual understanding in the following ways to alleviate the hallucination problem: 1. **Vision - guided preference optimization**: V - DPO uses vision - guided to enhance preference learning, ensuring that the model pays more attention to the visual context when generating text and reduces the dependence on language priors. 2. **Construction of contrastive data**: To evaluate the effectiveness and generalization ability of V - DPO, the authors construct a synthetic dataset containing response - contrast and image - contrast preferences and compare it with the existing manually - annotated hallucination samples. 3. **Classifier - Free Guidance (CFG)**: V - DPO combines Classifier - Free Guidance (CFG) to improve the sensitivity of the generated content to specific visual inputs. 4. **Optimization objective**: V - DPO introduces a vision - guided term in the optimization process to enhance the model's attention to the visual context. The specific optimization objective is as follows: \[ \max_{\pi} \mathbb{E}_{(v,x) \sim I \times P, y \sim \pi} \left[ r(v, x, y) - \beta D_{KL}[\pi(y|v, x) \| \pi_{ref}(y|v, x)] + \alpha D_{KL}[\pi(y|v, x) \| \pi(y|x)] \right] \] where \(r(v, x, y)\) is the reward function, \(\pi_{ref}\) is the reference model, and \(\alpha\) controls the weight of vision - guided. Through these methods, V - DPO can significantly improve the performance of LVLMs in various hallucination benchmark tests, especially showing stronger visual understanding and generation capabilities when dealing with non - conventional images.