Abstract:The content of visual and audio scenes is multi-faceted such that a video can be paired with various audio and vice-versa. Thereby, in video-to-audio generation task, it is imperative to introduce steering approaches for controlling the generated audio. While Video-to-Audio generation is a well-established generative task, existing methods lack such controllability. In this work, we propose VATT, a multi-modal generative framework that takes a video and an optional text prompt as input, and generates audio and optional textual description of the audio. Such a framework has two advantages: i) Video-to-Audio generation process can be refined and controlled via text which complements the context of visual information, and ii) The model can suggest what audio to generate for the video by generating audio captions. VATT consists of two key modules: VATT Converter, a LLM that is fine-tuned for instructions and includes a projection layer that maps video features to the LLM vector space; and VATT Audio, a transformer that generates audio tokens from visual frames and from optional text prompt using iterative parallel decoding. The audio tokens are converted to a waveform by pretrained neural codec. Experiments show that when VATT is compared to existing video-to-audio generation methods in objective metrics, it achieves competitive performance when the audio caption is not provided. When the audio caption is provided as a prompt, VATT achieves even more refined performance (lowest KLD score of 1.41). Furthermore, subjective studies show that VATT Audio has been chosen as preferred generated audio than audio generated by existing methods. VATT enables controllable video-to-audio generation through text as well as suggesting text prompts for videos through audio captions, unlocking novel applications such as text-guided video-to-audio generation and video-to-audio captioning.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two main problems in the video - to - audio generation task: 1. **Lack of controllability**: Although existing video - to - audio generation methods can generate audio related to video content, they lack fine - grained control over the generated audio. For example, in a video showing two cats fighting for territory, the model may generate gentle and friendly meows, which do not match the actual tense atmosphere of the scene. This mismatch mainly stems from the insufficient ability of the visual encoder to distinguish the sound properties of the same sound source in different situations. 2. **Failure to fully utilize text information**: Existing methods usually rely only on video frames for audio generation and ignore the supplementary role of text information. Text information can provide additional context and explanations to help generate more expected audio. However, pure text - to - audio generation models often cannot combine visual information well, resulting in audio that does not match the video in terms of time and semantics. To solve these problems, the author proposes a new framework **VATT (Video - to - Audio Through Text)**, which can generate audio in a text - guided manner and can automatically generate audio descriptions (i.e., audio subtitles) without text prompts. The unique advantages of VATT are: - **Refining and controlling the generated audio through text**: Text prompts can supplement video information, making the generated audio more in line with users' expectations. - **Generating audio subtitles**: The model can suggest what kind of audio to generate, thereby providing a reasonable audio description or classification for the video. Specifically, VATT contains two key modules: - **VATT Converter**: This is a fine - tuned large - scale language model (LLM) that contains a projection layer for mapping video features to the LLM vector space. - **VATT Audio**: This is a bidirectional transformer decoder that can generate audio tokens from visual frames and optional text prompts and convert them into waveforms using iterative parallel decoding. Through these improvements, VATT not only performs well in objective metrics (such as the KLD score on the VGGSound dataset) but also obtains a higher user preference in subjective evaluations. In addition, VATT is also an order of magnitude faster in generation speed than existing methods. In summary, the main contributions of the VATT framework include: - Proposing the first framework that can simultaneously achieve text - guided video - to - audio generation and video - to - audio subtitle generation. - Creating a large - scale synthetic audio subtitle dataset for training and generation under text - conditioned. - Achieving state - of - the - art video - to - audio generation performance on multiple benchmark datasets and significantly improving controllability and efficiency.

Tell What You Hear From What You See -- Video to Audio Generation Through Text

STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation

Video-to-Audio Generation with Hidden Alignment

Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation

Video-to-Audio Generation with Fine-grained Temporal Semantics

Semantically consistent Video-to-Audio Generation using Multimodal Language Large Model

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

Text-to-Audio Generation Synchronized with Videos

Gotta Hear Them All: Sound Source Aware Vision to Audio Generation

TAVT: Towards Transferable Audio-Visual Text Generation.

FoleyGen: Visually-Guided Audio Generation

Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

Text-Animator: Controllable Visual Text Video Generation

Diverse and Vivid Sound Generation from Text Descriptions

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Align, Adapt and Inject: Sound-guided Unified Image Generation

AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions

TAVGBench: Benchmarking Text to Audible-Video Generation

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models