Abstract:Recently, Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multi-modal context comprehension. However, they still suffer from hallucination problems referring to generating inconsistent outputs with the image content. To mitigate hallucinations, previous studies mainly focus on retraining LVLMs with custom datasets. Although effective, they inherently come with additional computational costs. In this paper, we propose a training-free framework, \textbf{MVP}, that aims to reduce hallucinations by making the most of the innate capabilities of the LVLMs via \textbf{M}ulti-\textbf{V}iew Multi-\textbf{P}ath Reasoning. Specifically, we first devise a multi-view information-seeking strategy to thoroughly perceive the comprehensive information in the image, which enriches the general global information captured by the original vision encoder in LVLMs. Furthermore, during the answer decoding, we observe that the occurrence of hallucinations has a strong correlation with the certainty of the answer tokens. Thus, we propose multi-path reasoning for each information view to quantify and aggregate the certainty scores for each potential answer among multiple decoding paths and finally decide the output answer. By fully grasping the information in the image and carefully considering the certainty of the potential answers when decoding, our MVP can effectively reduce hallucinations in LVLMs.The extensive experiments verify that our proposed MVP significantly mitigates the hallucination problem across four well-known LVLMs. The source code is available at: \url{<a class="link-external link-https" href="https://github.com/GasolSun36/MVP" rel="external noopener nofollow">this https URL</a>}.

VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning

A Unified Hallucination Mitigation Framework for Large Vision-Language Models

HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs

VisDiaHalBench: A Visual Dialogue Benchmark For Diagnosing Hallucination in Large Vision-Language Models

Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites

Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning

Effectively Enhancing Vision Language Large Models by Prompt Augmentation and Caption Utilization

AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

Prompting Large Language Models with Fine-Grained Visual Relations from Scene Graph for Visual Question Answering

Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs

MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context

Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs

Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning