Abstract:While Large Vision-Language Models (LVLMs) have rapidly advanced in recent years, the prevalent issue known as the `hallucination' problem has emerged as a significant bottleneck, hindering their real-world deployments. Existing methods mitigate this issue mainly from two perspectives: One approach leverages extra knowledge like robust instruction tuning LVLMs with curated datasets or employing auxiliary analysis networks, which inevitable incur additional costs. Another approach, known as contrastive decoding, induces hallucinations by manually disturbing the vision or instruction raw inputs and mitigates them by contrasting the outputs of the disturbed and original LVLMs. However, these approaches rely on empirical holistic input disturbances and double the inference cost. To avoid these issues, we propose a simple yet effective method named Self-Introspective Decoding (SID). Our empirical investigation reveals that pretrained LVLMs can introspectively assess the importance of vision tokens based on preceding vision and text (both instruction and generated) tokens. We develop the Context and Text-aware Token Selection (CT2S) strategy, which preserves only unimportant vision tokens after early layers of LVLMs to adaptively amplify text-informed hallucination during the auto-regressive decoding. This approach ensures that multimodal knowledge absorbed in the early layers induces multimodal contextual rather than aimless hallucinations. Subsequently, the original token logits subtract the amplified vision-and-text association hallucinations, guiding LVLMs decoding faithfully. Extensive experiments illustrate SID generates less-hallucination and higher-quality texts across various metrics, without extra knowledge and much additional computation burdens.

CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding

Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding

Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)

Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback

Delve into Visual Contrastive Decoding for Hallucination Mitigation of Large Vision-Language Models

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding

Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models

Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding

Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models

AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

Mitigating Object Hallucination via Concentric Causal Attention

Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens

IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding

Reducing Hallucinations in Vision-Language Models via Latent Space Steering

Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs