AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

Kim Sung-Bin,Oh Hyun-Bin,JungMok Lee,Arda Senocak,Joon Son Chung,Tae-Hyun Oh
2024-10-24
Abstract:Following the success of Large Language Models (LLMs), expanding their boundaries to new modalities represents a significant paradigm shift in multimodal understanding. Human perception is inherently multimodal, relying not only on text but also on auditory and visual cues for a complete understanding of the world. In recognition of this fact, audio-visual LLMs have recently emerged. Despite promising developments, the lack of dedicated benchmarks poses challenges for understanding and evaluating models. In this work, we show that audio-visual LLMs struggle to discern subtle relationships between audio and visual signals, leading to hallucinations, underscoring the need for reliable benchmarks. To address this, we introduce AVHBench, the first comprehensive benchmark specifically designed to evaluate the perception and comprehension capabilities of audio-visual LLMs. Our benchmark includes tests for assessing hallucinations, as well as the cross-modal matching and reasoning abilities of these models. Our results reveal that most existing audio-visual LLMs struggle with hallucinations caused by cross-interactions between modalities, due to their limited capacity to perceive complex multimodal signals and their relationships. Additionally, we demonstrate that simple training with our AVHBench improves robustness of audio-visual LLMs against hallucinations.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem that audio - visual large language models (audio - visual LLMs) are prone to hallucinations when processing multi - modal signals. Specifically, the author points out that current audio - visual large language models have difficulties in understanding and processing the complex relationships between audio and video signals, which may lead to the model generating inaccurate or non - existent information. For example, the model may imagine non - existent sounds based on visual cues, or imagine non - existent visual events based on sounds. To solve this problem, the author proposes a benchmarking tool named **AVHBench**, which is specifically used to evaluate the ability of audio - visual large language models in cross - modal hallucination. AVHBench contains four different tasks, which are respectively used to evaluate the following abilities of the model: 1. **Audio - driven Video Hallucination**: Evaluate whether audio signals will cause the model to hallucinate about visual objects or events. 2. **Video - driven Audio Hallucination**: Evaluate whether visual signals will cause the model to hallucinate about audio objects or events. 3. **Audio - visual Matching**: Evaluate the model's ability to recognize the corresponding relationship between audio and video signals. 4. **Audio - visual Captioning**: Evaluate the model's ability to accurately describe audio and video signals. Through these tasks, AVHBench can comprehensively evaluate the performance of audio - visual large language models in processing multi - modal signals and reveal their weaknesses in cross - modal hallucination. In addition, the author also proposes a semi - automatic annotation pipeline to construct a data set, which reduces the cost of manual annotation while ensuring high - quality annotation. Overall, the core problem of this paper is to improve the robustness and accuracy of audio - visual large language models in processing multi - modal signals and reduce the occurrence of cross - modal hallucinations.