Woodpecker: Hallucination Correction for Multimodal Large Language Models

Shukang Yin,Chaoyou Fu,Sirui Zhao,Tong Xu,Hao Wang,Dianbo Sui,Yunhang Shen,Ke Li,Xing Sun,Enhong Chen
2023-10-25
Abstract:Hallucination is a big shadow hanging over the rapidly evolving Multimodal Large Language Models (MLLMs), referring to the phenomenon that the generated text is inconsistent with the image content. In order to mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like a woodpecker heals trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at <a class="link-external link-https" href="https://github.com/BradyFU/Woodpecker" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper is primarily dedicated to addressing the hallucination problem in Multimodal Large Language Models (MLLMs). Specifically: 1. **Problem Description**: In the text descriptions generated by MLLMs, there often appears to be inconsistencies with the image content, a phenomenon known as "hallucination." Hallucination is divided into object-level hallucination and attribute-level hallucination. The former refers to objects mentioned in the description that do not actually exist in the image, while the latter refers to discrepancies between the described attributes of objects and the actual image. 2. **Limitations of Existing Methods**: Current methods to solve the hallucination problem mostly rely on instruction tuning, requiring retraining of the model to reduce hallucination. Although these methods are effective, they usually require a large amount of data and computational resources. 3. **Woodpecker Framework**: To address the above limitations, the paper proposes a framework called Woodpecker that does not require retraining. It corrects hallucinations in the text generated by MLLMs through five stages: key concept extraction, question formulation, visual knowledge verification, visual statement generation, and hallucination correction. The entire process is transparent and interpretable, effectively diagnosing and correcting hallucinated parts while providing corresponding evidence (such as bounding boxes). 4. **Experimental Results**: The paper conducts comprehensive quantitative and qualitative evaluations on datasets such as POPE, MME, and LLaVA-QA90. The results show that Woodpecker significantly improves the accuracy of baseline models, especially in correcting object-level hallucinations. In summary, the paper aims to effectively solve the hallucination problem in MLLMs through a framework that does not require retraining, thereby improving the reliability and accuracy of model outputs.