Abstract:The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of vision and language understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capacity. In this work, we introduce Acoustic Prompt Turning (APT), a new adapter extending LLMs and VLMs to the audio domain by soft prompting only. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as language model inputs. To mitigate the data scarcity in the audio domain, a multi-task learning strategy is proposed by formulating diverse audio tasks in a sequence-to-sequence manner. Moreover, we improve the framework of audio language model by using interleaved audio-text embeddings as the input sequence. This improved framework imposes zero constraints on the input format and thus is capable of tackling more understanding tasks, such as few-shot audio classification and audio reasoning. To further evaluate the reasoning ability of audio networks, we propose natural language audio reasoning (NLAR), a new task that analyses across two audio clips by comparison and summarization. Experiments show that APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the expert models (i.e., the networks trained on the targeted datasets) across various tasks. We finally demonstrate the APT's ability in extending frozen VLMs to the audio domain without finetuning, achieving promising results in the audio-visual question and answering task. Our code and model weights are released at <a class="link-external link-https" href="https://github.com/JinhuaLiang/APT" rel="external noopener nofollow">this https URL</a>.

Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

Real-time Caller Intent Detection In Human-Human Customer Support Spoken Conversations

Acoustics Based Intent Recognition Using Discovered Phonetic Units for Low Resource Languages

A Streaming End-to-End Framework for Spoken Language Understanding.

Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer

Non-Intrusive Speech Intelligibility Prediction for Hearing-Impaired Users using Intermediate ASR Features and Human Memory Models

Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and language Models for Intent Classification

Multilingual and Cross-Lingual Intent Detection from Spoken Data

Deliberation Model Based Two-Pass End-to-End Speech Recognition

Generalized zero-shot audio-to-intent classification

Using Automatic Speech Recognition to Measure the Intelligibility of Speech Synthesized from Brain Signals

Streaming End-to-End Bilingual ASR Systems with Joint Language Identification

Streaming Align-Refine for Non-autoregressive Deliberation

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

Analyzing And Improving Neural Speaker Embeddings for ASR

Towards Better Understanding of Spontaneous Conversations: Overcoming Automatic Speech Recognition Errors With Intent Recognition

Bidirectional RNN for Audio Deep Learning in an End-to-End Model

Calibrate and Refine! A Novel and Agile Framework for ASR-error Robust Intent Detection

Improving RNN-Transducers with Acoustic LookAhead

Building Accurate Low Latency ASR for Streaming Voice Search

Improving Speech Recognition for African American English With Audio Classification