Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Changan Chen,Puyuan Peng,Ami Baid,Zihui Xue,Wei-Ning Hsu,David Harwath,Kristen Grauman
2024-07-25
Abstract:Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets, Ego4D and EPIC-KITCHENS, and we introduce Ego4D-Sounds -- 1.2M curated clips with action-audio correspondence. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our approach is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.
Computer Vision and Pattern Recognition,Artificial Intelligence,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the problem of generating realistic audio from first-person perspective videos, particularly in real-world scenarios where background sounds and action sounds are mixed. Specifically, the paper proposes the following points: 1. **Separation of Background Sounds and Action Sounds**: - Existing methods usually assume that video and audio are perfectly aligned during training, but in practical applications, many background sounds (such as traffic noise, conversations, etc.) are not directly related to the visual content, leading to inaccurate or hallucinated sounds. - The paper proposes a novel method to distinguish between foreground action sounds and background environmental sounds and utilizes this mechanism during training. 2. **Novel Audio Generation Model**: - A new model named AV-LDM is proposed, which can generate audio that semantically and temporally matches the video content by using retrieval-augmented generation when given a new silent video. 3. **Dataset Expansion**: - To validate the effectiveness of the model, the paper introduces a large-scale first-person perspective video dataset called Ego4D-Sounds, containing 1.2 million carefully selected video clips with corresponding action audio. 4. **Experimental Results**: - Experiments show that the proposed model significantly outperforms existing methods on multiple evaluation metrics and demonstrates good potential in virtual reality game clips. In summary, the paper addresses the problem of effectively separating action sounds from background sounds when generating audio for real-world videos and demonstrates its potential value in various applications.