Abstract:Following the success of Large Language Models (LLMs), expanding their boundaries to new modalities represents a significant paradigm shift in multimodal understanding. Human perception is inherently multimodal, relying not only on text but also on auditory and visual cues for a complete understanding of the world. In recognition of this fact, audio-visual LLMs have recently emerged. Despite promising developments, the lack of dedicated benchmarks poses challenges for understanding and evaluating models. In this work, we show that audio-visual LLMs struggle to discern subtle relationships between audio and visual signals, leading to hallucinations, underscoring the need for reliable benchmarks. To address this, we introduce AVHBench, the first comprehensive benchmark specifically designed to evaluate the perception and comprehension capabilities of audio-visual LLMs. Our benchmark includes tests for assessing hallucinations, as well as the cross-modal matching and reasoning abilities of these models. Our results reveal that most existing audio-visual LLMs struggle with hallucinations caused by cross-interactions between modalities, due to their limited capacity to perceive complex multimodal signals and their relationships. Additionally, we demonstrate that simple training with our AVHBench improves robustness of audio-visual LLMs against hallucinations.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem that audio - visual large language models (audio - visual LLMs) are prone to hallucinations when processing multi - modal signals. Specifically, the author points out that current audio - visual large language models have difficulties in understanding and processing the complex relationships between audio and video signals, which may lead to the model generating inaccurate or non - existent information. For example, the model may imagine non - existent sounds based on visual cues, or imagine non - existent visual events based on sounds. To solve this problem, the author proposes a benchmarking tool named **AVHBench**, which is specifically used to evaluate the ability of audio - visual large language models in cross - modal hallucination. AVHBench contains four different tasks, which are respectively used to evaluate the following abilities of the model: 1. **Audio - driven Video Hallucination**: Evaluate whether audio signals will cause the model to hallucinate about visual objects or events. 2. **Video - driven Audio Hallucination**: Evaluate whether visual signals will cause the model to hallucinate about audio objects or events. 3. **Audio - visual Matching**: Evaluate the model's ability to recognize the corresponding relationship between audio and video signals. 4. **Audio - visual Captioning**: Evaluate the model's ability to accurately describe audio and video signals. Through these tasks, AVHBench can comprehensively evaluate the performance of audio - visual large language models in processing multi - modal signals and reveal their weaknesses in cross - modal hallucination. In addition, the author also proposes a semi - automatic annotation pipeline to construct a data set, which reduces the cost of manual annotation while ensuring high - quality annotation. Overall, the core problem of this paper is to improve the robustness and accuracy of audio - visual large language models in processing multi - modal signals and reduce the occurrence of cross - modal hallucinations.

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Visual Hallucinations of Multi-modal Large Language Models

Unified Hallucination Detection for Multimodal Large Language Models

Hallucination of Multimodal Large Language Models: A Survey

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

A Survey of Hallucination in Large Visual Language Models

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding