Abstract:The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of vision and language understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capacity. In this work, we introduce Acoustic Prompt Turning (APT), a new adapter extending LLMs and VLMs to the audio domain by soft prompting only. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as language model inputs. To mitigate the data scarcity in the audio domain, a multi-task learning strategy is proposed by formulating diverse audio tasks in a sequence-to-sequence manner. Moreover, we improve the framework of audio language model by using interleaved audio-text embeddings as the input sequence. This improved framework imposes zero constraints on the input format and thus is capable of tackling more understanding tasks, such as few-shot audio classification and audio reasoning. To further evaluate the reasoning ability of audio networks, we propose natural language audio reasoning (NLAR), a new task that analyses across two audio clips by comparison and summarization. Experiments show that APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the expert models (i.e., the networks trained on the targeted datasets) across various tasks. We finally demonstrate the APT's ability in extending frozen VLMs to the audio domain without finetuning, achieving promising results in the audio-visual question and answering task. Our code and model weights are released at <a class="link-external link-https" href="https://github.com/JinhuaLiang/APT" rel="external noopener nofollow">this https URL</a>.

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Real-Time Multimodal Turn-taking Prediction to Enhance Cooperative Dialogue during Human-Agent Interaction

Gated Multimodal Fusion with Contrastive Learning for Turn-taking Prediction in Human-robot Dialogue

Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model

Effective Cross-Utterance Language Modeling for Conversational Speech Recognition

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

Acoustic Model Fusion for End-to-end Speech Recognition

Using Large Language Model for End-to-End Chinese ASR and NER

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

Joint Modelling of Spoken Language Understanding Tasks with Integrated Dialog History

Multimodal and Multitask Approach to Listener's Backchannel Prediction: Can Prediction of Turn-changing and Turn-management Willingness Improve Backchannel Modeling?

IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities

Integrated Method of Deep Learning and Large Language Model in Speech Recognition

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

A Survey on Speech Large Language Models

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

Language Model Can Listen While Speaking

Multilingual Turn-taking Prediction Using Voice Activity Projection

A Novel LSTM-Based Speech Preprocessor for Speaker Diarization in Realistic Mismatch Conditions.