Combating Multimodal LLM Hallucination via Bottom-up Holistic Reasoning

Shengqiong Wu,Hao Fei,Liangming Pan,William Yang Wang,Shuicheng Yan,Tat-Seng Chua
2024-12-15
Abstract:Recent advancements in multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing various vision-language tasks. However, MLLMs face significant challenges with hallucinations, and misleading outputs that do not align with the input data. While existing efforts are paid to combat MLLM hallucinations, several pivotal challenges are still unsolved. First, while current approaches aggressively focus on addressing errors at the perception level, another important type at the cognition level requiring factual commonsense can be overlooked. In addition, existing methods might fall short in finding a more effective way to represent visual input, which is yet a key bottleneck that triggers visual hallucinations. Moreover, MLLMs can frequently be misled by faulty textual inputs and cause hallucinations, while unfortunately, this type of issue has long been overlooked by existing studies. Inspired by human intuition in handling hallucinations, this paper introduces a novel bottom-up reasoning framework. Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge, ensuring more reliable outputs. Extensive experiments demonstrate significant improvements in multiple hallucination benchmarks after integrating MLLMs with the proposed framework. In-depth analyses reveal the great potential of our methods in addressing perception- and cognition-level hallucinations.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the hallucination problem in Multimodal Large Language Models (MLLMs). Specifically, when dealing with vision - language tasks, MLLMs often produce misleading outputs that do not match the input data, namely hallucinations. These hallucinations seriously affect the practical application value of the model. #### Main types and challenges of hallucination 1. **Perceptual - level hallucinations**: - **Object hallucination**: For example, the model may misidentify objects in an image. - **Attribute hallucination**: For example, the model may misdescribe the attributes of an object. - **Relational hallucination**: For example, the model may wrongly infer the relationships between objects. 2. **Cognitive - level hallucinations**: - **Against common sense**: The model may generate content that goes against common sense. - **Object conflict**: For example, the objects mentioned in the text are inconsistent with the actual objects in the image. - **Attribute conflict**: For example, the attributes described in the text are inconsistent with the actual attributes in the image. - **Relational conflict**: For example, the relationships described in the text are inconsistent with the actual relationships in the image. - **Over - generalization**: For example, the model may make overly broad assumptions based on insufficient information. #### Deficiencies of existing methods 1. **Existing methods mainly focus on the perceptual level**, ignoring problems at the cognitive level, especially in cases where factual common - sense reasoning is required. 2. **Insufficient visual input representation**: Existing visual representation methods are not sufficient to effectively capture the semantic structure of visual content, leading to the occurrence of visual hallucinations. 3. **Hallucinations in text input are ignored**: There may be inconsistencies between the text input provided by the user and the visual content, which will cause the model to produce hallucinations. #### Solutions To solve the above problems, the author proposes a bottom - up holistic reasoning framework, which systematically solves the potential problems in visual and text inputs. By verifying and integrating perceptual - level information and cognitive - level common - sense knowledge, it ensures more reliable outputs. Specifically: 1. **Object recognition and visual perception**: Guide the model to focus on the visual areas most relevant to the user's question and generate a partial scene graph to capture complete visual information. 2. **Visual perception verification**: Use external tools to verify the objects, attributes, and relationships in the partial scene graph to ensure the accuracy of the perceived content. 3. **Question verification and adjustment**: Check whether the input question conflicts with high - fidelity visual perception and make necessary corrections. 4. **Common - sense reasoning**: When knowledge at the cognitive level is required, generate necessary common - sense statements. 5. **Common - sense verification**: Verify the generated common - sense statements through an external knowledge base. 6. **Question answering**: Synthesize all verified perceptual information and common - sense knowledge to generate the final answer. Through this framework, the author has demonstrated significant improvements in multiple benchmark tests, especially in reducing hallucinations at the perceptual and cognitive levels.