Abstract:The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of vision and language understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capacity. In this work, we introduce Acoustic Prompt Turning (APT), a new adapter extending LLMs and VLMs to the audio domain by soft prompting only. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as language model inputs. To mitigate the data scarcity in the audio domain, a multi-task learning strategy is proposed by formulating diverse audio tasks in a sequence-to-sequence manner. Moreover, we improve the framework of audio language model by using interleaved audio-text embeddings as the input sequence. This improved framework imposes zero constraints on the input format and thus is capable of tackling more understanding tasks, such as few-shot audio classification and audio reasoning. To further evaluate the reasoning ability of audio networks, we propose natural language audio reasoning (NLAR), a new task that analyses across two audio clips by comparison and summarization. Experiments show that APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the expert models (i.e., the networks trained on the targeted datasets) across various tasks. We finally demonstrate the APT's ability in extending frozen VLMs to the audio domain without finetuning, achieving promising results in the audio-visual question and answering task. Our code and model weights are released at <a class="link-external link-https" href="https://github.com/JinhuaLiang/APT" rel="external noopener nofollow">this https URL</a>.

WhisPAr: Transferring Pre-trained Audio Models to Fine-grained Classification Via Prompt and Adapter

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

On the Transferability of Whisper-based Representations for "In-the-Wild" Cross-Task Downstream Speech Applications

Audio-free Prompt Tuning for Language-Audio Models

TransPrompt V2: Transferable Prompt-based Fine-tuning for Few-shot Text Classification

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

When Vision Models Meet Parameter Efficient Look-Aside Adapters Without Large-Scale Audio Pretraining

Whispy: Adapting STT Whisper Models to Real-Time Environments

Transfer Learning and Bias Correction with Pre-trained Audio Embeddings

Adapting Language-Audio Models as Few-Shot Audio Learners

TransPrompt v2: A Transferable Prompting Framework for Cross-task Text Classification

Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

Audio Prompt Adapter: Unleashing Music Editing Abilities for Text-to-Music with Lightweight Finetuning

AAT: Adapting Audio Transformer for Various Acoustics Recognition Tasks

Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers

Transfer Learning for Speech and Language Processing