Abstract:Hallucination is a big shadow hanging over the rapidly evolving Multimodal Large Language Models (MLLMs), referring to the phenomenon that the generated text is inconsistent with the image content. In order to mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like a woodpecker heals trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at <a class="link-external link-https" href="https://github.com/BradyFU/Woodpecker" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper is primarily dedicated to addressing the hallucination problem in Multimodal Large Language Models (MLLMs). Specifically: 1. **Problem Description**: In the text descriptions generated by MLLMs, there often appears to be inconsistencies with the image content, a phenomenon known as "hallucination." Hallucination is divided into object-level hallucination and attribute-level hallucination. The former refers to objects mentioned in the description that do not actually exist in the image, while the latter refers to discrepancies between the described attributes of objects and the actual image. 2. **Limitations of Existing Methods**: Current methods to solve the hallucination problem mostly rely on instruction tuning, requiring retraining of the model to reduce hallucination. Although these methods are effective, they usually require a large amount of data and computational resources. 3. **Woodpecker Framework**: To address the above limitations, the paper proposes a framework called Woodpecker that does not require retraining. It corrects hallucinations in the text generated by MLLMs through five stages: key concept extraction, question formulation, visual knowledge verification, visual statement generation, and hallucination correction. The entire process is transparent and interpretable, effectively diagnosing and correcting hallucinated parts while providing corresponding evidence (such as bounding boxes). 4. **Experimental Results**: The paper conducts comprehensive quantitative and qualitative evaluations on datasets such as POPE, MME, and LLaVA-QA90. The results show that Woodpecker significantly improves the accuracy of baseline models, especially in correcting object-level hallucinations. In summary, the paper aims to effectively solve the hallucination problem in MLLMs through a framework that does not require retraining, thereby improving the reliability and accuracy of model outputs.

Woodpecker: Hallucination Correction for Multimodal Large Language Models

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Piculet: Specialized Models-Guided Hallucination Decrease for MultiModal Large Language Models

Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs

NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

PFME: A Modular Approach for Fine-grained Hallucination Detection and Editing of Large Language Models

Mitigating Multilingual Hallucination in Large Vision-Language Models

Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models

A Topic-level Self-Correctional Approach to Mitigate Hallucinations in MLLMs

MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation

Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization

Hallucination of Multimodal Large Language Models: A Survey

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

Visual Hallucinations of Multi-modal Large Language Models

Pensieve: Retrospect-then-Compare Mitigates Visual Hallucination

Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models

Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites