Abstract:The content of visual and audio scenes is multi-faceted such that a video can be paired with various audio and vice-versa. Thereby, in video-to-audio generation task, it is imperative to introduce steering approaches for controlling the generated audio. While Video-to-Audio generation is a well-established generative task, existing methods lack such controllability. In this work, we propose VATT, a multi-modal generative framework that takes a video and an optional text prompt as input, and generates audio and optional textual description of the audio. Such a framework has two advantages: i) Video-to-Audio generation process can be refined and controlled via text which complements the context of visual information, and ii) The model can suggest what audio to generate for the video by generating audio captions. VATT consists of two key modules: VATT Converter, a LLM that is fine-tuned for instructions and includes a projection layer that maps video features to the LLM vector space; and VATT Audio, a transformer that generates audio tokens from visual frames and from optional text prompt using iterative parallel decoding. The audio tokens are converted to a waveform by pretrained neural codec. Experiments show that when VATT is compared to existing video-to-audio generation methods in objective metrics, it achieves competitive performance when the audio caption is not provided. When the audio caption is provided as a prompt, VATT achieves even more refined performance (lowest KLD score of 1.41). Furthermore, subjective studies show that VATT Audio has been chosen as preferred generated audio than audio generated by existing methods. VATT enables controllable video-to-audio generation through text as well as suggesting text prompts for videos through audio captions, unlocking novel applications such as text-guided video-to-audio generation and video-to-audio captioning.

VarietySound: Timbre-Controllable Video to Sound Generation Via Unsupervised Information Disentanglement

Visual to Sound: Generating Natural Sound for Videos in the Wild

Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation

Video-to-Audio Generation with Hidden Alignment

STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

Self-Supervised Audio-Visual Soundscape Stylization

Diverse and Vivid Sound Generation from Text Descriptions

Diffsound: Discrete Diffusion Model for Text-to-Sound Generation

Gotta Hear Them All: Sound Source Aware Vision to Audio Generation

Tell What You Hear From What You See -- Video to Audio Generation Through Text

Synthesizing Audio from Silent Video using Sequence to Sequence Modeling

Timbre Transfer with Variational Auto Encoding and Cycle-Consistent Adversarial Networks

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

Conditional Generation of Audio from Video via Foley Analogies

Video Echoed in Harmony: Learning and Sampling Video-Integrated Chord Progression Sequences for Controllable Video Background Music Generation

Video-to-Audio Generation with Fine-grained Temporal Semantics

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound

Read, Watch and Scream! Sound Generation from Text and Video

Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

SonicVisionLM: Playing Sound with Vision Language Models

Align, Adapt and Inject: Sound-guided Unified Image Generation