Abstract:Multi-modal Large Language Models (MLLMs) demonstrate remarkable success across various vision-language tasks. However, they suffer from visual hallucination, where the generated responses diverge from the provided image. Are MLLMs oblivious to the accurate visual cues when they hallucinate? Our investigation reveals that the visual branch may equally advocate both accurate and erroneous content. To address this issue, we propose Pensieve, a training-free method that leverages the analogous visual hallucinations, which are induced by images sharing common semantic and appearance characteristics, to mitigate hallucination. Specifically, Pensieve enables MLLMs to retrospect relevant images as references and compare their visual content with the test image via confidence score subtraction. Moreover, our paradigm balances the effects of addressing errors from both the visual and textual branches by adaptively scaling the subtracted scores. Experiments on Whoops, LLaVA Bench, POPE, and MME demonstrate the efficacy of Pensieve in mitigating visual hallucination, surpassing other advanced decoding strategies. Pensieve also aids MLLMs in identifying visual details and enhance the specificity of generated image descriptions.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of visual hallucinations in multimodal large language models (MLLMs) during vision-language tasks. Specifically, when MLLMs generate descriptions that do not match the provided images, this phenomenon is referred to as visual hallucinations. Visual hallucinations can lead to generated content that conflicts with the image, fabricates details, or overlooks critical visual elements. ### Background and Motivation Despite the impressive performance of MLLMs in various vision-language tasks, they are prone to visual hallucinations. These hallucinations can manifest as incorrect descriptions of image content, such as errors in object location, activities, attributes, etc. The authors found through research that MLLMs do not completely ignore correct visual cues when generating hallucinatory content but rather support both correct and incorrect content to some extent. Therefore, they propose a new method—Pensieve—to mitigate such visual hallucinations. ### Method Overview Pensieve is a training-free method that primarily consists of two key components: 1. **Reviewing Visual Concepts**: Construct a reference database containing various visual concepts for MLLMs to review. Specifically, select samples from the COCO Caption dataset that cover a wide range of everyday visual content. The reference images should be capable of inducing similar visual hallucinations to utilize these hallucinations to reduce them. 2. **Contrasting Visual Concepts**: Help distinguish accurate visual cues by comparing confidence scores (logits) of the test image and reference images. Specific steps include: - Retrieve k reference images similar to the test image from the reference database. - For each candidate word, generate k+2 different predictions (one from the test image, one from the diffusion image, and k from the retrieved reference images). - Highlight accurate candidate words through subtraction operations, reducing the impact of similar hallucinations in similar images. ### Experimental Results The authors validated the effectiveness of Pensieve on multiple benchmark datasets, including Whoops, LLaVA Bench, MME, and POPE. Experimental results show that Pensieve significantly outperforms other advanced decoding strategies, such as VCD and DoLa, in reducing visual hallucinations. Specifically, on the Whoops dataset, Pensieve increased the FaithScore of LLaVA1.5 by 0.4 and improved the overall score of InstructBLIP by 55. Additionally, Pensieve helps MLLMs recognize details in images, generating more specific image descriptions. ### Main Contributions 1. **Empirical Study**: Reveals that MLLMs do not completely ignore correct visual cues when generating hallucinatory content but support both correct and incorrect content to some extent. 2. **New Method**: Proposes Pensieve, a training-free method that mitigates visual hallucinations by reviewing similar images and contrasting confidence scores. 3. **Experimental Validation**: Demonstrates the effectiveness of Pensieve on multiple benchmark datasets, particularly in reducing visual hallucinations and improving the specificity of image descriptions.

Pensieve: Retrospect-then-Compare Mitigates Visual Hallucination

Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models

Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding

A Unified Hallucination Mitigation Framework for Large Vision-Language Models

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

Delve into Visual Contrastive Decoding for Hallucination Mitigation of Large Vision-Language Models

Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

Mitigating Multilingual Hallucination in Large Vision-Language Models

Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs

Piculet: Specialized Models-Guided Hallucination Decrease for MultiModal Large Language Models

ConVis: Contrastive Decoding with Hallucination Visualization for Mitigating Hallucinations in Multimodal Large Language Models

Multi-Modal Hallucination Control by Visual Information Grounding

Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback

Hallucination of Multimodal Large Language Models: A Survey

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding