Abstract:Various audio-LLMs (ALLMs) have been explored recently for tackling different audio tasks simultaneously using a single, unified model. While existing evaluations of ALLMs primarily focus on single-audio tasks, real-world applications often involve processing multiple audio streams simultaneously. To bridge this gap, we propose the first multi-audio evaluation (MAE) benchmark that consists of 20 datasets from 11 multi-audio tasks encompassing both speech and sound scenarios. Comprehensive experiments on MAE demonstrate that the existing ALLMs, while being powerful in comprehending primary audio elements in individual audio inputs, struggling to handle multi-audio scenarios. To this end, we propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios using discriminative learning on our proposed synthetic data. The results demonstrate that the proposed MALLM outperforms all baselines and achieves high data efficiency using synthetic data without requiring human annotations. The proposed MALLM opens the door for ALLMs towards multi-audio processing era and brings us closer to replicating human auditory capabilities in machines.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the deficiencies of existing Audio Large - Language Models (ALLM) in handling multi - audio - stream tasks. Specifically: 1. **Gap between single - audio and multi - audio processing**: Existing ALLMs mainly focus on tasks of processing single - audio inputs, while applications in the real world often require handling multiple audio streams simultaneously. For example, a virtual assistant may need to process voice commands from different users at the same time. 2. **Lack of systematic evaluation**: Although there are evaluation benchmarks for single - audio tasks, there is a lack of systematic evaluation and benchmark testing for multi - audio tasks. This has led to an unclear quantification of the performance of current ALLMs in multi - audio scenarios. 3. **Simulation of human auditory ability**: In order to be closer to human auditory processing ability, machines need to be able to effectively understand and process the relationships between multiple audio streams. To solve these problems, the paper makes the following contributions: - **Multi - Audio Evaluation Benchmark (MAE)**: Constructed the first benchmark specifically for evaluating the multi - audio processing ability of ALLMs, covering 11 tasks, including open - ended and closed - ended generation tasks in the speech and sound fields. - **Advanced Multi - Audio Large - Language Model (MALLM)**: Proposed an innovative and scalable multi - audio large - language model, which enhances the model's performance on multi - audio tasks through discriminative learning and synthetic data generation strategies. - **Comprehensive evaluation**: Conducted extensive experimental evaluations on 15 existing ALLMs, providing a solid foundation for future research. Through these contributions, the paper aims to promote the transformation of ALLMs from single - audio processing to multi - audio processing, thereby enhancing human - machine interaction capabilities and getting closer to replicating human auditory processing abilities.

Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

AudioBench: A Universal Benchmark for Audio Large Language Models

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners

Advancing Multi-talker ASR Performance with Large Language Models

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Can Large Language Models Understand Spatial Audio?

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

Large Language Models Are Strong Audio-Visual Speech Recognition Learners

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Audio-Visual LLM for Video Understanding

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

A Survey on Speech Large Language Models

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

AudioLog: LLMs-Powered Long Audio Logging with Hybrid Token-Semantic Contrastive Learning

C3LLM: Conditional Multimodal Content Generation Using Large Language Models

Sparks of Large Audio Models: A Survey and Outlook