Abstract:Large Vision-Language Models (LVLMs) are increasingly adept at generating contextually detailed and coherent responses from visual inputs. However, their application in multimodal decision-making and open-ended generation is hindered by a notable rate of hallucinations, where generated text inaccurately represents the visual contents. To address this issue, this paper introduces the Instruction Contrastive Decoding (ICD) method, a novel approach designed to reduce hallucinations during LVLM inference. Our method is inspired by our observation that what we call disturbance instructions significantly exacerbate hallucinations in multimodal fusion modules. ICD contrasts distributions from standard and instruction disturbance, thereby increasing alignment uncertainty and effectively subtracting hallucinated concepts from the original distribution. Through comprehensive experiments on discriminative benchmarks (POPE and MME) and a generative benchmark (LLaVa-Bench), we demonstrate that ICD significantly mitigates both object-level and attribute-level hallucinations. Moreover, our method not only addresses hallucinations but also significantly enhances the general perception and recognition capabilities of LVLMs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the hallucination phenomenon in large vision - language models (LVLMs) during multimodal decision - making and open - generation tasks. Specifically, although LVLMs can generate context - detailed and coherent responses from visual inputs, they often inaccurately represent visual content when generating text, leading to object - level and attribute - level hallucinations. The paper introduces a new method - Instruction Contrastive Decoding (ICD) - aimed at reducing the hallucination phenomenon during LVLM inference. By comparing the distributions under standard instructions and perturbed instructions, the ICD method can increase alignment uncertainty, thereby effectively subtracting hallucination concepts from the original distribution. ### Main contributions of the paper: 1. **In - depth analysis of how perturbed instructions exacerbate hallucinations**: The paper provides a detailed understanding of the root causes of hallucinations from the perspectives of statistical bias and language priors. 2. **Introduction of the ICD method**: This is a novel strategy that emphasizes highlighting and then reducing hallucinations first, reducing hallucinations during the inference process by adjusting the distribution. 3. **Verification of the effectiveness of the ICD method through extensive experiments**: The paper demonstrates the significant effect of the ICD method in reducing hallucinations in multiple benchmark tests and consistently improves performance in 14 general perception and recognition tasks. ### Method overview: - **LVLM inference process**: LVLMs consist of a visual encoder, a fusion module, and a language model. The visual encoder extracts visual features, the fusion module achieves multimodal alignment, and the language model generates text responses. - **Instruction perturbation**: By adding role prefixes (positive or negative) in front of the original instructions, the uncertainty of multimodal alignment is affected, thereby exacerbating hallucinations. - **Instruction Contrastive Decoding (ICD)**: By comparing the probability distributions under standard instructions and perturbed instructions, words that maximize the probability of standard instructions while minimizing the probability of perturbed instructions are selected, thereby reducing hallucinations. The ICD method also combines adaptive credibility constraints to further optimize the contrastive distribution. ### Experimental results: - **POPE benchmark**: The ICD method significantly improves accuracy, precision, recall, and F1 - score on three subsets: MSCOCO, A - OKVQA, and GQA. - **MME benchmark**: The ICD method performs excellently in object - level and attribute - level hallucination tasks, significantly outperforming the baseline model and the VCD method. - **LLaVa - Bench benchmark**: Through case studies, the effectiveness of the ICD method in open - generation tasks is qualitatively evaluated. In conclusion, this paper proposes an effective method to reduce the hallucination phenomenon in LVLMs, which not only improves the accuracy of the model but also enhances its overall performance in multimodal tasks.

Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)

IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding

Delve into Visual Contrastive Decoding for Hallucination Mitigation of Large Vision-Language Models

Mitigating Hallucination Issues in Small-Parameter LLMs Through Inter-Layer Contrastive Decoding

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding

Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models

HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding

Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence

Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback

Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models

CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs

Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding

Alleviating Hallucinations of Large Language Models through Induced Hallucinations

Mitigating Multilingual Hallucination in Large Vision-Language Models

Mitigating Dialogue Hallucination for Large Vision Language Models via Adversarial Instruction Tuning

Reducing Hallucinations in Vision-Language Models via Latent Space Steering

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model