Abstract:Large vision-language models (LVLMs) suffer from hallucination, resulting in misalignment between the output textual response and the input visual content. Recent research indicates that the over-reliance on the Large Language Model (LLM) backbone, as one cause of the LVLM hallucination, inherently introduces bias from language priors, leading to insufficient context attention to the visual inputs. We tackle this issue of hallucination by mitigating such over-reliance through preference learning. We propose Vision-guided Direct Preference Optimization (V-DPO) to enhance visual context learning at training time. To interpret the effectiveness and generalizability of V-DPO on different types of training data, we construct a synthetic dataset containing both response- and image-contrast preference pairs, compared against existing human-annotated hallucination samples. Our approach achieves significant improvements compared with baseline methods across various hallucination benchmarks. Our analysis indicates that V-DPO excels in learning from image-contrast preference data, demonstrating its superior ability to elicit and understand nuances of visual context. Our code is publicly available at <a class="link-external link-https" href="https://github.com/YuxiXie/V-DPO" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the hallucination problem in large - scale vision - language models (LVLMs). Specifically, hallucination refers to the phenomenon where the text output generated by the model does not match the input visual content, resulting in inaccurate or completely wrong descriptions. This phenomenon is particularly evident when dealing with non - conventional images. #### Reasons for the Hallucination Problem 1. **Over - reliance on large - scale language models (LLMs)**: Existing LVLMs usually integrate pre - trained LLMs as their backbone architectures. However, this integration method may cause the model to rely too much on language patterns and ignore the context information of visual inputs. 2. **Bias of language priors**: Since LLMs themselves have strong language generation capabilities, they may introduce biases from language priors, making the model more inclined to follow existing language patterns when generating text rather than generating according to the actual visual inputs. 3. **Insufficient context attention**: Existing models fail to fully pay attention to the visual context when processing visual and text modalities, resulting in the generated content not being well - aligned with the visual inputs. #### Solutions To solve the above problems, the authors propose the **Vision - guided Direct Preference Optimization (V - DPO)** method. V - DPO enhances visual understanding in the following ways to alleviate the hallucination problem: 1. **Vision - guided preference optimization**: V - DPO uses vision - guided to enhance preference learning, ensuring that the model pays more attention to the visual context when generating text and reduces the dependence on language priors. 2. **Construction of contrastive data**: To evaluate the effectiveness and generalization ability of V - DPO, the authors construct a synthetic dataset containing response - contrast and image - contrast preferences and compare it with the existing manually - annotated hallucination samples. 3. **Classifier - Free Guidance (CFG)**: V - DPO combines Classifier - Free Guidance (CFG) to improve the sensitivity of the generated content to specific visual inputs. 4. **Optimization objective**: V - DPO introduces a vision - guided term in the optimization process to enhance the model's attention to the visual context. The specific optimization objective is as follows: \[ \max_{\pi} \mathbb{E}_{(v,x) \sim I \times P, y \sim \pi} \left[ r(v, x, y) - \beta D_{KL}[\pi(y|v, x) \| \pi_{ref}(y|v, x)] + \alpha D_{KL}[\pi(y|v, x) \| \pi(y|x)] \right] \] where \(r(v, x, y)\) is the reward function, \(\pi_{ref}\) is the reference model, and \(\alpha\) controls the weight of vision - guided. Through these methods, V - DPO can significantly improve the performance of LVLMs in various hallucination benchmark tests, especially showing stronger visual understanding and generation capabilities when dealing with non - conventional images.

V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization

Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback

Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation

A Unified Hallucination Mitigation Framework for Large Vision-Language Models

Delve into Visual Contrastive Decoding for Hallucination Mitigation of Large Vision-Language Models

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

Evaluating Object Hallucination in Large Vision-Language Models

Mitigating Multilingual Hallucination in Large Vision-Language Models

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Mitigating Dialogue Hallucination for Large Vision Language Models via Adversarial Instruction Tuning

Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding

Prescribing the Right Remedy: Mitigating Hallucinations in Large Vision-Language Models via Targeted Instruction Tuning

Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence

Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding