Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison

Qian Yang,Weixiang Yan,Aishwarya Agrawal
2024-10-09
Abstract:Despite tremendous advancements, current state-of-the-art Vision-Language Models (VLMs) are still far from perfect. They tend to hallucinate and may generate biased responses. In such circumstances, having a way to assess the reliability of a given response generated by a VLM is quite useful. Existing methods, such as estimating uncertainty using answer likelihoods or prompt-based confidence generation, often suffer from overconfidence. Other methods use self-consistency comparison but are affected by confirmation biases. To alleviate these, we propose Decompose and Compare Consistency (DeCC) for reliability measurement. By comparing the consistency between the direct answer generated using the VLM's internal reasoning process, and the indirect answers obtained by decomposing the question into sub-questions and reasoning over the sub-answers produced by the VLM, DeCC measures the reliability of VLM's direct answer. Experiments across six vision-language tasks with three VLMs show DeCC's reliability estimation achieves better correlation with task accuracy compared to the existing methods.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the current state - of - the - art vision - language models (VLMs) have the problems of hallucination and generating biased answers when generating responses. Specifically, existing reliability assessment methods, such as those based on answer - likelihood to estimate uncertainty or those based on prompt - generation confidence, are often over - confident, while other methods such as self - consistency comparison may be affected by confirmation bias. Therefore, there is an urgent need for a more reliable method to assess the reliability of the answers generated by VLMs. To solve the above problems, the authors propose the "Decompose and Compare Consistency (DeCC)" method. DeCC assesses the reliability of the answers generated by VLMs by decomposing the original question into multiple sub - questions and comparing the consistency between the direct answer and the indirect answer obtained through sub - question reasoning. Experimental results show that the reliability estimates of DeCC on six vision - language tasks are better correlated with task accuracy than existing methods. ### Specific Problem Description 1. **Hallucination and Bias Problems**: Current VLMs are prone to generating wrong or biased answers. 2. **Limitations of Existing Methods**: - **Uncertainty - Based Methods**: Such as using answer - likelihood to estimate uncertainty or prompting the model to generate confidence, these methods are often over - confident. - **Self - Consistency Comparison**: This method may be affected by confirmation bias. ### Solution The DeCC method proposed by the authors includes two main steps: 1. **Task Decomposition**: Decompose the original question into multiple sub - questions, and let the candidate VLM answer these sub - questions to generate a series of sub - question - answer pairs (sub - QA pairs). 2. **Consistency Comparison**: Use the candidate VLM and an independent language model (LLM) to reason on these sub - question - answer pairs respectively, and then compare the consistency between their reasoning results and the direct answer, thereby assessing the reliability of the direct answer. In this way, DeCC can more accurately assess the reliability of the answers generated by VLMs and reduce the impact of hallucination and bias.