Sicong Leng,Yun Xing,Zesen Cheng,Yang Zhou,Hang Zhang,Xin Li,Deli Zhao,Shijian Lu,Chunyan Miao,Lidong Bing
Abstract:Recent advancements in large multimodal models (LMMs) have significantly enhanced performance across diverse tasks, with ongoing efforts to further integrate additional modalities such as video and audio. However, most existing LMMs remain vulnerable to hallucinations, the discrepancy between the factual multimodal input and the generated textual output, which has limited their applicability in various real-world scenarios. This paper presents the first systematic investigation of hallucinations in LMMs involving the three most common modalities: language, visual, and audio. Our study reveals two key contributors to hallucinations: overreliance on unimodal priors and spurious inter-modality correlations. To address these challenges, we introduce the benchmark The Curse of Multi-Modalities (CMM), which comprehensively evaluates hallucinations in LMMs, providing a detailed analysis of their underlying issues. Our findings highlight key vulnerabilities, including imbalances in modality integration and biases from training data, underscoring the need for balanced cross-modal learning and enhanced hallucination mitigation strategies. Based on our observations and findings, we suggest potential research directions that could enhance the reliability of LMMs.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper aims to systematically study the hallucination problem of large multimodal models (LMMs) when processing language, visual, and audio inputs. Hallucination refers to the inconsistency between the generated text output and the multimodal input, which severely limits the application of LMMs in the real world, especially in tasks requiring precise and factual content generation.
### Main contributions
1. **Systematic research**:
- For the first time, a systematic study on the hallucination problem of LMMs in three common modalities: language, vision, and audio was carried out.
- Two main factors leading to hallucination were identified: excessive reliance on unimodal priors and false cross - modal correlations.
2. **Benchmark test**:
- A comprehensive benchmark test, "The Curse of Multi - Modalities (CMM)", was introduced to evaluate the hallucination problem of LMMs.
- CMM transforms hallucination evaluation into a binary classification task through object - level and event - level probes, covering a wide range of visual, audio, and their combined contexts.
3. **Evaluation and analysis**:
- A series of state - of - the - art LMMs were evaluated in visual, audio, and combined contexts, revealing the limitations of the models and the fundamental challenges in multimodal learning.
- Diagnostic indicators such as perception accuracy (PA) and hallucination resistance rate (HR) were proposed, providing a comprehensive framework for measuring the perception ability of LMMs and the severity of hallucination.
### Main causes of hallucination
1. **Excessive reliance on unimodal priors**:
- **Language - dominated**: The model overly depends on pre - trained large language models (LLMs), and the generated responses follow language patterns or prior knowledge in large corpora, even when visual or audio inputs provide contradictory information.
- **Vision - dominated**: The model overly depends on visual information, ignoring language and auditory cues, resulting in the generated output being overly influenced by the visual context.
- **Audio - dominated**: The model overly depends on auditory input, ignoring visual or language information, resulting in the generated output being overly influenced by the auditory context.
2. **False cross - modal correlations**:
- **Vision - language**: The model hallucinates visual objects or events based on pre - trained patterns. For example, "mobile phone" often appears with "person", and the model may hallucinate a mobile phone when it recognizes a person, even if there is no mobile phone in the actual scene.
- **Audio - language**: The model associates non - existent sound events with text descriptions because these patterns are over - represented in the pre - training data. For example, "dog barking" often appears, and the model may hallucinate dog barking when the dog is just whimpering.
- **Vision - audio - language**: In video - audio joint training, the model learns false correlations between frequently co - occurring visual objects and audio events. For example, "bird singing" is often paired with "tree" in visual annotations in audio descriptions, and the model may hallucinate a tree when it hears bird singing, and vice versa.
### Conclusion
Through systematic research and benchmark tests, this paper reveals the hallucination problem of LMMs in multimodal fusion and proposes specific diagnostic indicators and improvement directions. These findings provide an important reference for improving the reliability of LMMs and reducing hallucination.