Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

Beitao Chen,Xinyu Lyu,Lianli Gao,Jingkuan Song,Heng Tao Shen

2024-05-24

Abstract:Although Large Visual Language Models (LVLMs) have demonstrated exceptional abilities in understanding multimodal data, they invariably suffer from hallucinations, leading to a disconnect between the generated text and the corresponding images. Almost all current visual contrastive decoding methods attempt to mitigate these hallucinations by introducing visual uncertainty information that appropriately widens the contrastive logits gap between hallucinatory and targeted ones. However, due to uncontrollable nature of the global visual uncertainty, they struggle to precisely induce the hallucinatory tokens, which severely limits their effectiveness in mitigating hallucinations and may even lead to the generation of undesired hallucinations. To tackle this issue, we conducted the theoretical analysis to promote the effectiveness of contrast decoding. Building on this insight, we introduce a novel optimization strategy named Hallucination-Induced Optimization (HIO). This strategy seeks to amplify the contrast between hallucinatory and targeted tokens relying on a fine-tuned theoretical preference model (i.e., Contrary Bradley-Terry Model), thereby facilitating efficient contrast decoding to alleviate hallucinations in LVLMs. Extensive experimental research demonstrates that our HIO strategy can effectively reduce hallucinations in LVLMs, outperforming state-of-the-art methods across various benchmarks.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper focuses on the "illusion" problem in large-scale visual language models (LVLMs), where the generated text does not match the corresponding image content. Existing visual contrastive decoding methods attempt to alleviate this illusion by introducing visual uncertainty information, but due to the uncontrollability of global visual uncertainty, these methods face difficulties in accurately inducing illusion tokens, limiting their effectiveness in reducing illusions and even potentially leading to undesired illusion generation. To address this issue, the paper conducts theoretical analysis to improve the effectiveness of contrastive decoding. Based on this, the paper proposes a new strategy called "illusion-induced optimization" (HIO). HIO utilizes a fine-tuned theoretical preference model, namely the contrastive Bradley-Terry model, to amplify the contrast between illusion tokens and target tokens, thereby promoting efficient contrastive decoding and reducing illusions in LVLMs. Experimental studies demonstrate that the HIO strategy effectively reduces illusions in LVLMs and outperforms existing methods in multiple benchmark tests. In summary, the paper attempts to address the problem of reducing inaccuracies and mismatches with image content, referred to as the "illusion" phenomenon, in large-scale visual language models for understanding and generating multimodal data. It proposes a new optimization strategy, HIO, to achieve this goal.

Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

Delve into Visual Contrastive Decoding for Hallucination Mitigation of Large Vision-Language Models

Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding

Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)

Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization

HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models

Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models

Alleviating Hallucinations of Large Language Models through Induced Hallucinations

Reducing Hallucinations in Vision-Language Models via Latent Space Steering