Abstract:The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of vision and language understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capacity. In this work, we introduce Acoustic Prompt Turning (APT), a new adapter extending LLMs and VLMs to the audio domain by soft prompting only. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as language model inputs. To mitigate the data scarcity in the audio domain, a multi-task learning strategy is proposed by formulating diverse audio tasks in a sequence-to-sequence manner. Moreover, we improve the framework of audio language model by using interleaved audio-text embeddings as the input sequence. This improved framework imposes zero constraints on the input format and thus is capable of tackling more understanding tasks, such as few-shot audio classification and audio reasoning. To further evaluate the reasoning ability of audio networks, we propose natural language audio reasoning (NLAR), a new task that analyses across two audio clips by comparison and summarization. Experiments show that APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the expert models (i.e., the networks trained on the targeted datasets) across various tasks. We finally demonstrate the APT's ability in extending frozen VLMs to the audio domain without finetuning, achieving promising results in the audio-visual question and answering task. Our code and model weights are released at <a class="link-external link-https" href="https://github.com/JinhuaLiang/APT" rel="external noopener nofollow">this https URL</a>.

AND: Audio Network Dissection for Interpreting Deep Acoustic Models

Visualizing and Understanding Neural Models in NLP

Toward a Better Understanding of Deep Neural Network Based Acoustic Modelling: An Empirical Investigation

Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models

Towards audio language modeling -- an overview

Fine-grained Artificial Neurons in Audio-transformers for Disentangling Neural Auditory Encoding.

Audio-Visual Model Distillation Using Acoustic Images

Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems

Tackling Interpretability in Audio Classification Networks with Non-negative Matrix Factorization

Dissecting neural computations in the human auditory pathway using deep neural networks for speech

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

Dissecting neural computations of the human auditory pathway using deep neural networks for speech

AudioMNIST: Exploring Explainable Artificial Intelligence for Audio Analysis on a Simple Benchmark

Building DNN acoustic models for large vocabulary speech recognition

A Survey of the Interpretability Aspect of Deep Learning Models

Recent Progresses in Deep Learning based Acoustic Models (Updated)

Look, Listen and Infer.

Deep Sensory Substitution: Noninvasively Enabling Biological Neural Networks to Receive Input from Artificial Neural Networks

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

Interpreting deep urban sound classification using Layer-wise Relevance Propagation