Abstract:The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of vision and language understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capacity. In this work, we introduce Acoustic Prompt Turning (APT), a new adapter extending LLMs and VLMs to the audio domain by soft prompting only. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as language model inputs. To mitigate the data scarcity in the audio domain, a multi-task learning strategy is proposed by formulating diverse audio tasks in a sequence-to-sequence manner. Moreover, we improve the framework of audio language model by using interleaved audio-text embeddings as the input sequence. This improved framework imposes zero constraints on the input format and thus is capable of tackling more understanding tasks, such as few-shot audio classification and audio reasoning. To further evaluate the reasoning ability of audio networks, we propose natural language audio reasoning (NLAR), a new task that analyses across two audio clips by comparison and summarization. Experiments show that APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the expert models (i.e., the networks trained on the targeted datasets) across various tasks. We finally demonstrate the APT's ability in extending frozen VLMs to the audio domain without finetuning, achieving promising results in the audio-visual question and answering task. Our code and model weights are released at <a class="link-external link-https" href="https://github.com/JinhuaLiang/APT" rel="external noopener nofollow">this https URL</a>.

WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models

Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

TransPrompt V2: Transferable Prompt-based Fine-tuning for Few-shot Text Classification

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

WhisPAr: Transferring Pre-trained Audio Models to Fine-grained Classification Via Prompt and Adapter

Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt Learning with Data-Dependent Prior

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

HybridPrompt: Bridging Language Models and Human Priors in Prompt Tuning for Visual Question Answering

TransPrompt - Towards an Automatic Transferable Prompting Framework for Few-shot Text Classification.

Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting

Evolutionary Verbalizer Search for Prompt-based Few Shot Text Classification

PromptST: Abstract Prompt Learning for End-to-End Speech Translation

Eliciting Knowledge from Pretrained Language Models for Prototypical Prompt Verbalizer

Unified Prompt Learning Makes Pre-Trained Language Models Better Few-Shot Learners

A Unified Framework for Multi-intent Spoken Language Understanding with prompting

VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval

TransPrompt v2: A Transferable Prompting Framework for Cross-task Text Classification

Prompting Visual-Language Models for Efficient Video Understanding

Audio-free Prompt Tuning for Language-Audio Models

Open-vocabulary Auditory Neural Decoding Using fMRI-prompted LLM