Abstract:Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience. Video-to-Audio (V2A), as a particular type of automatic foley task, presents inherent challenges related to audio-visual synchronization. These challenges encompass maintaining the content consistency between the input video and the generated audio, as well as the alignment of temporal and loudness properties within the video. To address these issues, we construct a controllable video-to-audio synthesis model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals. To ensure content consistency between the synthesized audio and target video, we introduce the Mask-Attention Module (MAM), which employs masked video instruction to enable the model to focus on regions of interest. Additionally, we implement the Time-Loudness Module (TLM), which uses an auxiliary loudness signal to ensure the synthesis of sound that aligns with the video in both loudness and temporal dimensions. Furthermore, we have extended a large-scale V2A dataset, named VGGSound-Caption, by annotating caption prompts. Extensive experiments on challenging benchmarks across two large-scale V2A datasets verify Draw an Audio achieves the state-of-the-art. Project page: <a class="link-external link-https" href="https://yannqi.github.io/Draw-an-Audio/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are the three main challenges in video - to - audio (V2A) synthesis: content consistency, temporal consistency, and loudness consistency. Specifically: 1. **Content Consistency**: The generated audio needs to be consistent with the semantic content of the input video. For example, dog barks should not occur in a video with only cats. 2. **Temporal Consistency**: The generated audio needs to be synchronized with the video in time, ensuring that audio events occur simultaneously with visual events in the video. 3. **Loudness Consistency**: The generated audio needs to match the video in loudness. Considering the human sensitivity to loudness changes, for example, the sound of an elephant's footsteps should gradually increase as the elephant approaches. To solve these problems, the authors propose a controllable video - to - audio synthesis model named **Draw an Audio**, which supports multiple instruction inputs such as drawing masks and loudness signals. The following are the main contributions of this model: - **Mask - Attention Module (MAM)**: By introducing the drawn video mask, the model can focus on the regions of interest, thereby improving the consistency between the generated audio and the video content. - **Time - Loudness Module (TLM)**: By introducing an auxiliary loudness signal, it ensures that the generated audio is aligned with the video in the time and loudness dimensions. In addition, the authors also expand a large - scale V2A dataset called **VGGSound - Caption** and verify the superior performance of **Draw an Audio** on two large - scale V2A datasets through extensive experiments. ### Formula Summary 1. **Forward Diffusion Process**: \[ q(z_t|z_{t - 1})=\sqrt{1-\beta_t}z_{t - 1}+\sqrt{\beta_t}\epsilon_t \] \[ q(z_t|z_0)=\sqrt{\alpha_t}z_0+\sqrt{1-\alpha_t}\epsilon_t \] where \(\epsilon\sim\mathcal{N}(0, I)\), \(\alpha_t = \prod_{i = 1}^t(1-\beta_i)\). 2. **Loss Function of the Backward Diffusion Process**: \[ L_{LDM}=\sum_{t = 1}^T\mathbb{E}_{\epsilon_t\sim\mathcal{N}(0, I), z_0}\left[\|\epsilon_t-\hat{\epsilon}_\theta(z_t,\tau)\|_2^2\right] \] 3. **RMS Energy Calculation**: \[ F_{rms}(i)=\sqrt{\frac{1}{N_{win}}\sum_{n = i}^{i + N_{win}-1}(x_a(n))^2} \] where \(N_{win}\) is the window size and \(N_{hop}\) is the hop length. 4. **Generation of Hand - drawn Loudness Signals**: \[ F'_{rms}=AAP(F_{rms}) \] \[ F_{signal}(t)=\sum_{i=-N'_{win}/2}^{N'_{win}/2}F'_{rms}(t - i)\cdot w'(i) \] where \(w'(i)\) is the normalized Gaussian kernel, \(\sigma\) is the variance, and \(N'_{win}\) is the window size. Through these methods, **Draw an Audio** achieves high - quality video - to - audio synthesis and performs excellently in multiple benchmark tests.

Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

FoleyGen: Visually-Guided Audio Generation

Synth-AC: Enhancing Audio Captioning with Synthetic Supervision

Video-to-Audio Generation with Fine-grained Temporal Semantics

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound

Video-to-Audio Generation with Hidden Alignment

Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis

STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

Video-Guided Foley Sound Generation with Multimodal Controls

Exploring the Role of Audio in Video Captioning

Vision-Infused Deep Audio Inpainting

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

FastFoley: Non-autoregressive Foley Sound Generation Based on Visual Semantics

Make-A-Voice: Unified Voice Synthesis With Discrete Representation

Conditional Generation of Audio from Video via Foley Analogies

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Tell What You Hear From What You See -- Video to Audio Generation Through Text