Comprehensive Audio Query Handling System with Integrated Expert Models and Contextual Understanding

Vakada Naveen,Arvind Krishna Sridhar,Yinyi Guo,Erik Visser
2024-12-05
Abstract:This paper presents a comprehensive chatbot system designed to handle a wide range of audio-related queries by integrating multiple specialized audio processing models. The proposed system uses an intent classifier, trained on a diverse audio query dataset, to route queries about audio content to expert models such as Automatic Speech Recognition (ASR), Speaker Diarization, Music Identification, and Text-to-Audio generation. A 3.8 B LLM model then takes inputs from an Audio Context Detection (ACD) module extracting audio event information from the audio and post processes text domain outputs from the expert models to compute the final response to the user. We evaluated the system on custom audio tasks and MMAU sound set benchmarks. The custom datasets were motivated by target use cases not covered in industry benchmarks and included ACD-timestamp-QA (Question Answering) as well as ACD-temporal-QA datasets to evaluate timestamp and temporal reasoning questions, respectively. First we determined that a BERT based Intent Classifier outperforms LLM-fewshot intent classifier in routing queries. Experiments further show that our approach significantly improves accuracy on some custom tasks compared to state-of-the-art Large Audio Language Models and outperforms models in the 7B parameter size range on the sound testset of the MMAU benchmark, thereby offering an attractive option for on device deployment.
Audio and Speech Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations of current chatbot systems when dealing with queries related to audio content. Specifically, existing chatbots are usually only able to handle specific audio tasks, such as speech recognition or music recommendation, and are unable to comprehensively handle multiple types of audio queries. In addition, existing systems perform poorly in understanding and responding to complex audio - related questions, especially in situations where multiple audio processing models need to be combined. To address these challenges, the paper proposes a comprehensive chatbot system that processes a wide range of audio - related queries by integrating multiple specialized audio processing models (such as Automatic Speech Recognition (ASR), Speaker Diarization, Music Recognition, and Text - to - Audio Generation). The core components of the system include an intent classifier, which can accurately route user queries to the corresponding expert models, and generate coherent and context - related responses through a language model. In addition, the system also introduces an Audio Context Detection (ACD) module to enhance the ability to understand and process audio events. The following are the key innovation points of this system: 1. **Intent Classifier**: A BERT - based intent classifier is trained to route audio - related text queries to the appropriate audio processing models. Experimental results show that the BERT - based intent classifier outperforms the LLM - Fewshot intent classifier in terms of precision, recall, and F1 score. 2. **Expert Model Integration**: The system integrates multiple specialized audio processing models, such as ASR, Speaker Diarization, Music Identification, and Text - to - Audio generation, to handle complex audio queries. 3. **Audio Context Detection (ACD)**: The ACD module is introduced to extract audio event information and its timestamps, thereby improving the system's context understanding ability. Experiments show that using ACD metadata in JSON format significantly improves the system's accuracy. 4. **Response Generation**: The system uses a large - language model (LLM), combined with the output from expert models and chat history, to generate the final user response. In particular, the system limits the length of the chat history to the most recent 10 rounds of conversation to avoid performance degradation caused by too many input tokens. 5. **Benchmark Testing and Evaluation**: The paper proposes two new datasets for time - reasoning tasks (ACD - timestamp - QA and ACD - temporal - QA), and evaluates the performance of the system on these two datasets as well as in the MMAU benchmark test. The results show that the system performs well in handling timestamp and time - reasoning problems. Overall, the paper proposes an innovative chatbot system, aiming to enhance the system's ability to handle complex audio queries by integrating multiple audio processing models and advanced language models. This system not only outperforms existing large - scale audio - language models in performance but also has advantages in device - side deployment.