FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

Yiming Zhang,Yicheng Gu,Yanhong Zeng,Zhening Xing,Yuancheng Wang,Zhizheng Wu,Kai Chen

2024-07-02

Abstract:We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience. Despite its wide range of applications, existing approaches encounter limitations when it comes to simultaneously synthesizing high-quality and video-aligned (i.e.,, semantic relevant and temporal synchronized) sounds. To overcome these limitations, we propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation. FoleyCrafter comprises two key components: the semantic adapter for semantic alignment and the temporal controller for precise audio-video synchronization. The semantic adapter utilizes parallel cross-attention layers to condition audio generation on video features, producing realistic sound effects that are semantically relevant to the visual content. Meanwhile, the temporal controller incorporates an onset detector and a timestampbased adapter to achieve precise audio-video alignment. One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents. We conduct extensive quantitative and qualitative experiments on standard benchmarks to verify the effectiveness of FoleyCrafter. Models and codes are available at <a class="link-external link-https" href="https://github.com/open-mmlab/FoleyCrafter" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the issue of automatically generating high-quality, synchronized sound effects (Neural Foley) for silent videos. Specifically: 1. **High-Quality Sound Generation**: Existing methods have limitations in generating high-quality and synchronized sounds with videos. The paper proposes a new framework, FoleyCrafter, which is based on a pre-trained text-to-audio model to ensure high-quality audio generation. 2. **Semantic Alignment and Temporal Synchronization**: FoleyCrafter includes two key components: - **Semantic Adapter**: Utilizes parallel cross-attention layers to conditionally input video features, generating realistic sound effects that are semantically related to the visual content. - **Temporal Controller**: Comprises a start detector and a timestamp-based adapter to achieve precise audio-video synchronization. 3. **Controllability**: FoleyCrafter is compatible with text prompts, allowing users to generate controllable and diverse video-to-audio conversions based on their intentions. The effectiveness of FoleyCrafter is validated through extensive quantitative and qualitative experiments, demonstrating its state-of-the-art performance on common benchmarks.

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

Video-Guided Foley Sound Generation with Multimodal Controls

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound

Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis

FastFoley: Non-autoregressive Foley Sound Generation Based on Visual Semantics

AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent Videos With Deep Learning

FoleyGen: Visually-Guided Audio Generation

Conditional Generation of Audio from Video via Foley Analogies

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis

T-FOLEY: A Controllable Waveform-Domain Diffusion Model for Temporal-Event-Guided Foley Sound Synthesis

FoleyGAN: Visually Guided Generative Adversarial Network-Based Synchronous Sound Generation in Silent Videos

Exploring Domain-Specific Enhancements for a Neural Foley Synthesizer

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Text-Driven Foley Sound Generation With Latent Diffusion Model

Foley Music: Learning to Generate Music from Videos

Video-to-Audio Generation with Hidden Alignment

Semantically consistent Video-to-Audio Generation using Multimodal Language Large Model

Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7