Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

Beitao Chen,Xinyu Lyu,Lianli Gao,Jingkuan Song,Heng Tao Shen
2024-05-24
Abstract:Although Large Visual Language Models (LVLMs) have demonstrated exceptional abilities in understanding multimodal data, they invariably suffer from hallucinations, leading to a disconnect between the generated text and the corresponding images. Almost all current visual contrastive decoding methods attempt to mitigate these hallucinations by introducing visual uncertainty information that appropriately widens the contrastive logits gap between hallucinatory and targeted ones. However, due to uncontrollable nature of the global visual uncertainty, they struggle to precisely induce the hallucinatory tokens, which severely limits their effectiveness in mitigating hallucinations and may even lead to the generation of undesired hallucinations. To tackle this issue, we conducted the theoretical analysis to promote the effectiveness of contrast decoding. Building on this insight, we introduce a novel optimization strategy named Hallucination-Induced Optimization (HIO). This strategy seeks to amplify the contrast between hallucinatory and targeted tokens relying on a fine-tuned theoretical preference model (i.e., Contrary Bradley-Terry Model), thereby facilitating efficient contrast decoding to alleviate hallucinations in LVLMs. Extensive experimental research demonstrates that our HIO strategy can effectively reduce hallucinations in LVLMs, outperforming state-of-the-art methods across various benchmarks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper focuses on the "illusion" problem in large-scale visual language models (LVLMs), where the generated text does not match the corresponding image content. Existing visual contrastive decoding methods attempt to alleviate this illusion by introducing visual uncertainty information, but due to the uncontrollability of global visual uncertainty, these methods face difficulties in accurately inducing illusion tokens, limiting their effectiveness in reducing illusions and even potentially leading to undesired illusion generation. To address this issue, the paper conducts theoretical analysis to improve the effectiveness of contrastive decoding. Based on this, the paper proposes a new strategy called "illusion-induced optimization" (HIO). HIO utilizes a fine-tuned theoretical preference model, namely the contrastive Bradley-Terry model, to amplify the contrast between illusion tokens and target tokens, thereby promoting efficient contrastive decoding and reducing illusions in LVLMs. Experimental studies demonstrate that the HIO strategy effectively reduces illusions in LVLMs and outperforms existing methods in multiple benchmark tests. In summary, the paper attempts to address the problem of reducing inaccuracies and mismatches with image content, referred to as the "illusion" phenomenon, in large-scale visual language models for understanding and generating multimodal data. It proposes a new optimization strategy, HIO, to achieve this goal.