Abstract:This paper presents a comprehensive chatbot system designed to handle a wide range of audio-related queries by integrating multiple specialized audio processing models. The proposed system uses an intent classifier, trained on a diverse audio query dataset, to route queries about audio content to expert models such as Automatic Speech Recognition (ASR), Speaker Diarization, Music Identification, and Text-to-Audio generation. A 3.8 B LLM model then takes inputs from an Audio Context Detection (ACD) module extracting audio event information from the audio and post processes text domain outputs from the expert models to compute the final response to the user. We evaluated the system on custom audio tasks and MMAU sound set benchmarks. The custom datasets were motivated by target use cases not covered in industry benchmarks and included ACD-timestamp-QA (Question Answering) as well as ACD-temporal-QA datasets to evaluate timestamp and temporal reasoning questions, respectively. First we determined that a BERT based Intent Classifier outperforms LLM-fewshot intent classifier in routing queries. Experiments further show that our approach significantly improves accuracy on some custom tasks compared to state-of-the-art Large Audio Language Models and outperforms models in the 7B parameter size range on the sound testset of the MMAU benchmark, thereby offering an attractive option for on device deployment.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the limitations of current chatbot systems when dealing with queries related to audio content. Specifically, existing chatbots are usually only able to handle specific audio tasks, such as speech recognition or music recommendation, and are unable to comprehensively handle multiple types of audio queries. In addition, existing systems perform poorly in understanding and responding to complex audio - related questions, especially in situations where multiple audio processing models need to be combined. To address these challenges, the paper proposes a comprehensive chatbot system that processes a wide range of audio - related queries by integrating multiple specialized audio processing models (such as Automatic Speech Recognition (ASR), Speaker Diarization, Music Recognition, and Text - to - Audio Generation). The core components of the system include an intent classifier, which can accurately route user queries to the corresponding expert models, and generate coherent and context - related responses through a language model. In addition, the system also introduces an Audio Context Detection (ACD) module to enhance the ability to understand and process audio events. The following are the key innovation points of this system: 1. **Intent Classifier**: A BERT - based intent classifier is trained to route audio - related text queries to the appropriate audio processing models. Experimental results show that the BERT - based intent classifier outperforms the LLM - Fewshot intent classifier in terms of precision, recall, and F1 score. 2. **Expert Model Integration**: The system integrates multiple specialized audio processing models, such as ASR, Speaker Diarization, Music Identification, and Text - to - Audio generation, to handle complex audio queries. 3. **Audio Context Detection (ACD)**: The ACD module is introduced to extract audio event information and its timestamps, thereby improving the system's context understanding ability. Experiments show that using ACD metadata in JSON format significantly improves the system's accuracy. 4. **Response Generation**: The system uses a large - language model (LLM), combined with the output from expert models and chat history, to generate the final user response. In particular, the system limits the length of the chat history to the most recent 10 rounds of conversation to avoid performance degradation caused by too many input tokens. 5. **Benchmark Testing and Evaluation**: The paper proposes two new datasets for time - reasoning tasks (ACD - timestamp - QA and ACD - temporal - QA), and evaluates the performance of the system on these two datasets as well as in the MMAU benchmark test. The results show that the system performs well in handling timestamp and time - reasoning problems. Overall, the paper proposes an innovative chatbot system, aiming to enhance the system's ability to handle complex audio queries by integrating multiple audio processing models and advanced language models. This system not only outperforms existing large - scale audio - language models in performance but also has advantages in device - side deployment.

Comprehensive Audio Query Handling System with Integrated Expert Models and Contextual Understanding

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models

AQUALLM: Audio Question Answering Data Generation Using Large Language Models

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension

A New Mmwave-Speech Multimodal Speech System for Voice User Interface

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Listen, Think, and Understand

Leveraging Topics and Audio Features with Multimodal Attention for Audio Visual Scene-Aware Dialog

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

AudioBench: A Universal Benchmark for Audio Large Language Models

Multimodal Audio-textual Architecture for Robust Spoken Language Understanding

Sparks of Large Audio Models: A Survey and Outlook

Contextualization of ASR with LLM using phonetic retrieval-based augmentation

An Interactive Multi-modal Query Answering System with Retrieval-Augmented Large Language Models

WavCraft: Audio Editing and Generation with Large Language Models

CarExpert: Leveraging Large Language Models for In-Car Conversational Question Answering