Abstract:Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets, Ego4D and EPIC-KITCHENS, and we introduce Ego4D-Sounds -- 1.2M curated clips with action-audio correspondence. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our approach is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the problem of generating realistic audio from first-person perspective videos, particularly in real-world scenarios where background sounds and action sounds are mixed. Specifically, the paper proposes the following points: 1. **Separation of Background Sounds and Action Sounds**: - Existing methods usually assume that video and audio are perfectly aligned during training, but in practical applications, many background sounds (such as traffic noise, conversations, etc.) are not directly related to the visual content, leading to inaccurate or hallucinated sounds. - The paper proposes a novel method to distinguish between foreground action sounds and background environmental sounds and utilizes this mechanism during training. 2. **Novel Audio Generation Model**: - A new model named AV-LDM is proposed, which can generate audio that semantically and temporally matches the video content by using retrieval-augmented generation when given a new silent video. 3. **Dataset Expansion**: - To validate the effectiveness of the model, the paper introduces a large-scale first-person perspective video dataset called Ego4D-Sounds, containing 1.2 million carefully selected video clips with corresponding action audio. 4. **Experimental Results**: - Experiments show that the proposed model significantly outperforms existing methods on multiple evaluation metrics and demonstrates good potential in virtual reality game clips. In summary, the paper addresses the problem of effectively separating action sounds from background sounds when generating audio for real-world videos and demonstrates its potential value in various applications.

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Visual to Sound: Generating Natural Sound for Videos in the Wild

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

Conditional Generation of Audio from Video via Foley Analogies

Video-Guided Foley Sound Generation with Multimodal Controls

Epic-Sounds: A Large-scale Dataset of Actions That Sound

Self-Supervised Audio-Visual Soundscape Stylization

Egocentric Audio-Visual Object Localization

Active Audio-Visual Separation of Dynamic Sound Sources

Sound Adversarial Audio-Visual Navigation

SonicVisionLM: Playing Sound with Vision Language Models

Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

Video-to-Audio Generation with Hidden Alignment

A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval

Gotta Hear Them All: Sound Source Aware Vision to Audio Generation

SOAF: Scene Occlusion-aware Neural Acoustic Field

Audio-Synchronized Visual Animation

EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos

AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis