Abstract:Multi-modal Large Language Models (MLLMs) demonstrate remarkable success across various vision-language tasks. However, they suffer from visual hallucination, where the generated responses diverge from the provided image. Are MLLMs oblivious to the accurate visual cues when they hallucinate? Our investigation reveals that the visual branch may equally advocate both accurate and erroneous content. To address this issue, we propose Pensieve, a training-free method that leverages the analogous visual hallucinations, which are induced by images sharing common semantic and appearance characteristics, to mitigate hallucination. Specifically, Pensieve enables MLLMs to retrospect relevant images as references and compare their visual content with the test image via confidence score subtraction. Moreover, our paradigm balances the effects of addressing errors from both the visual and textual branches by adaptively scaling the subtracted scores. Experiments on Whoops, LLaVA Bench, POPE, and MME demonstrate the efficacy of Pensieve in mitigating visual hallucination, surpassing other advanced decoding strategies. Pensieve also aids MLLMs in identifying visual details and enhance the specificity of generated image descriptions.
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve
This paper aims to address the issue of visual hallucinations in multimodal large language models (MLLMs) during vision-language tasks. Specifically, when MLLMs generate descriptions that do not match the provided images, this phenomenon is referred to as visual hallucinations. Visual hallucinations can lead to generated content that conflicts with the image, fabricates details, or overlooks critical visual elements.
### Background and Motivation
Despite the impressive performance of MLLMs in various vision-language tasks, they are prone to visual hallucinations. These hallucinations can manifest as incorrect descriptions of image content, such as errors in object location, activities, attributes, etc. The authors found through research that MLLMs do not completely ignore correct visual cues when generating hallucinatory content but rather support both correct and incorrect content to some extent. Therefore, they propose a new method—Pensieve—to mitigate such visual hallucinations.
### Method Overview
Pensieve is a training-free method that primarily consists of two key components:
1. **Reviewing Visual Concepts**: Construct a reference database containing various visual concepts for MLLMs to review. Specifically, select samples from the COCO Caption dataset that cover a wide range of everyday visual content. The reference images should be capable of inducing similar visual hallucinations to utilize these hallucinations to reduce them.
2. **Contrasting Visual Concepts**: Help distinguish accurate visual cues by comparing confidence scores (logits) of the test image and reference images. Specific steps include:
- Retrieve k reference images similar to the test image from the reference database.
- For each candidate word, generate k+2 different predictions (one from the test image, one from the diffusion image, and k from the retrieved reference images).
- Highlight accurate candidate words through subtraction operations, reducing the impact of similar hallucinations in similar images.
### Experimental Results
The authors validated the effectiveness of Pensieve on multiple benchmark datasets, including Whoops, LLaVA Bench, MME, and POPE. Experimental results show that Pensieve significantly outperforms other advanced decoding strategies, such as VCD and DoLa, in reducing visual hallucinations. Specifically, on the Whoops dataset, Pensieve increased the FaithScore of LLaVA1.5 by 0.4 and improved the overall score of InstructBLIP by 55. Additionally, Pensieve helps MLLMs recognize details in images, generating more specific image descriptions.
### Main Contributions
1. **Empirical Study**: Reveals that MLLMs do not completely ignore correct visual cues when generating hallucinatory content but support both correct and incorrect content to some extent.
2. **New Method**: Proposes Pensieve, a training-free method that mitigates visual hallucinations by reviewing similar images and contrasting confidence scores.
3. **Experimental Validation**: Demonstrates the effectiveness of Pensieve on multiple benchmark datasets, particularly in reducing visual hallucinations and improving the specificity of image descriptions.