Abstract:Despite tremendous advancements, current state-of-the-art Vision-Language Models (VLMs) are still far from perfect. They tend to hallucinate and may generate biased responses. In such circumstances, having a way to assess the reliability of a given response generated by a VLM is quite useful. Existing methods, such as estimating uncertainty using answer likelihoods or prompt-based confidence generation, often suffer from overconfidence. Other methods use self-consistency comparison but are affected by confirmation biases. To alleviate these, we propose Decompose and Compare Consistency (DeCC) for reliability measurement. By comparing the consistency between the direct answer generated using the VLM's internal reasoning process, and the indirect answers obtained by decomposing the question into sub-questions and reasoning over the sub-answers produced by the VLM, DeCC measures the reliability of VLM's direct answer. Experiments across six vision-language tasks with three VLMs show DeCC's reliability estimation achieves better correlation with task accuracy compared to the existing methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the current state - of - the - art vision - language models (VLMs) have the problems of hallucination and generating biased answers when generating responses. Specifically, existing reliability assessment methods, such as those based on answer - likelihood to estimate uncertainty or those based on prompt - generation confidence, are often over - confident, while other methods such as self - consistency comparison may be affected by confirmation bias. Therefore, there is an urgent need for a more reliable method to assess the reliability of the answers generated by VLMs. To solve the above problems, the authors propose the "Decompose and Compare Consistency (DeCC)" method. DeCC assesses the reliability of the answers generated by VLMs by decomposing the original question into multiple sub - questions and comparing the consistency between the direct answer and the indirect answer obtained through sub - question reasoning. Experimental results show that the reliability estimates of DeCC on six vision - language tasks are better correlated with task accuracy than existing methods. ### Specific Problem Description 1. **Hallucination and Bias Problems**: Current VLMs are prone to generating wrong or biased answers. 2. **Limitations of Existing Methods**: - **Uncertainty - Based Methods**: Such as using answer - likelihood to estimate uncertainty or prompting the model to generate confidence, these methods are often over - confident. - **Self - Consistency Comparison**: This method may be affected by confirmation bias. ### Solution The DeCC method proposed by the authors includes two main steps: 1. **Task Decomposition**: Decompose the original question into multiple sub - questions, and let the candidate VLM answer these sub - questions to generate a series of sub - question - answer pairs (sub - QA pairs). 2. **Consistency Comparison**: Use the candidate VLM and an independent language model (LLM) to reason on these sub - question - answer pairs respectively, and then compare the consistency between their reasoning results and the direct answer, thereby assessing the reliability of the direct answer. In this way, DeCC can more accurately assess the reliability of the answers generated by VLMs and reduce the impact of hallucination and bias.

Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison

Maintaining Reasoning Consistency in Compositional Visual Question Answering

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models

MM-R$^3$: On (In-)Consistency of Multi-modal Large Language Models (MLLMs)

Evaluating Consistencies in LLM responses through a Semantic Clustering of Question Answering

Visual Question Decomposition on Multimodal Large Language Models

DIEM: Decomposition-Integration Enhancing Multimodal Insights

Logical Implications for Visual Question Answering Consistency

Semantic Consistency for Assuring Reliability of Large Language Models

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Unveiling the Tapestry of Consistency in Large Vision-Language Models

A Claim Decomposition Benchmark for Long-form Answer Verification

Mirror-Consistency: Harnessing Inconsistency in Majority Voting

CAST: Cross-modal Alignment Similarity Test for Vision Language Models

Multi-Model Consistency for LLMs’ Evaluation

Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA

Accurate, yet inconsistent? Consistency Analysis on Language Understanding Models

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Enhancing Answer Reliability Through Inter-Model Consensus of Large Language Models

Trust but Verify: Programmatic VLM Evaluation in the Wild