Abstract:The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of vision and language understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capacity. In this work, we introduce Acoustic Prompt Turning (APT), a new adapter extending LLMs and VLMs to the audio domain by soft prompting only. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as language model inputs. To mitigate the data scarcity in the audio domain, a multi-task learning strategy is proposed by formulating diverse audio tasks in a sequence-to-sequence manner. Moreover, we improve the framework of audio language model by using interleaved audio-text embeddings as the input sequence. This improved framework imposes zero constraints on the input format and thus is capable of tackling more understanding tasks, such as few-shot audio classification and audio reasoning. To further evaluate the reasoning ability of audio networks, we propose natural language audio reasoning (NLAR), a new task that analyses across two audio clips by comparison and summarization. Experiments show that APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the expert models (i.e., the networks trained on the targeted datasets) across various tasks. We finally demonstrate the APT's ability in extending frozen VLMs to the audio domain without finetuning, achieving promising results in the audio-visual question and answering task. Our code and model weights are released at <a class="link-external link-https" href="https://github.com/JinhuaLiang/APT" rel="external noopener nofollow">this https URL</a>.

LLM-AD: Large Language Model based Audio Description System

Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies

AutoAD: Movie Description in Context

AutoAD III: The Prequel -- Back to the Pixels

AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

Contextual AD Narration with Interleaved Multimodal Sequence

Making Accessible Movies Easily: an Intelligent Tool for Authoring and Integrating Audio Descriptions to Movies

Audio-Visual LLM for Video Understanding

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval

Toward Automatic Audio Description Generation for Accessible Videos

Empowering the Deaf and Hard of Hearing Community: Enhancing Video Captions Using Large Language Models

Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

LLM as an Art Director (LaDi): Using LLMs to improve Text-to-Media Generators

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

DistinctAD: Distinctive Audio Description Generation in Contexts

Movie Description

C3LLM: Conditional Multimodal Content Generation Using Large Language Models

Semantically consistent Video-to-Audio Generation using Multimodal Language Large Model