Abstract:We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos. Generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets. Existing work has been limited to domains like speech, music, or impact sounds and cannot easily capture the broad range of audio frequencies found in egocentric videos. EgoSonics addresses these limitations by building on the strength of latent diffusion models for conditioned audio synthesis. We first encode and process audio and video data into a form that is suitable for generation. The encoded data is used to train our model to generate audio tracks that capture the semantics of the input video. Our proposed SyncroNet builds on top of ControlNet to provide control signals that enables temporal synchronization to the synthesized audio. Extensive evaluations show that our model outperforms existing work in audio quality, and in our newly proposed synchronization evaluation method. Furthermore, we demonstrate downstream applications of our model in improving video summarization.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to generate synchronous and semantically meaningful audio tracks for silent first - person perspective (egocentric) videos. Specifically, the author focuses on first - person videos in daily activities, which are usually filmed by head - mounted or body - mounted cameras, providing the perspective of the observer. ### Problem Background Existing work has the following limitations in generating synchronous audio: 1. **Domain Limitation**: Most existing methods are limited to specific domains, such as speech, music, or impact sounds, and are unable to capture the wide range of audio frequencies present in daily activities. 2. **Poor Temporal Synchronization**: Many models are unable to generate audio that is synchronous with the input video, especially when dealing with complex temporal and spatial understanding. 3. **Limited Audio Quality**: Existing models can usually only handle low - frequency audio (less than 8 KHz) and the generated audio quality is not high. 4. **Frame Rate Limitation**: Existing methods usually use only a small number of frames (1 - 4 frames) and are unable to capture the dynamic changes in the video. ### Solution To solve the above problems, EgoSonics proposes a method based on Latent Diffusion Models (LDMs) to achieve high - quality, synchronous audio generation through the following steps: 1. **Data Encoding and Processing**: First, encode and process the paired audio - video data to make it suitable for the generation task. 2. **SyncroNet Module**: Introduce the SyncroNet module, which is based on ControlNet, extracts temporal information from the video through self - attention and cross - attention mechanisms, and generates control signals to guide audio generation. 3. **Spatio - Temporal Alignment**: Represent the audio as a short - time Fourier transform (STFT) image and use the Stable Diffusion model to generate an audio spectrogram to ensure that the generated audio is aligned with the video in time and space. 4. **High - Resolution Audio Generation**: Use a video frame rate of 30 fps and an audio sampling rate of up to 20 KHz to generate high - quality audio. ### Main Contributions 1. **Propose EgoSonics**: A method that can generate synchronous and semantically meaningful audio for silent first - person videos. 2. **SyncroNet Module**: Can extract temporal information from the video and generate control signals to achieve better audio generation synchronization. 3. **New Synchronization Evaluation Method**: Propose a new method to accurately measure the synchronization quality between the generated audio and the input video. Through these improvements, EgoSonics significantly outperforms existing methods in terms of audio quality and synchronization and shows its potential applications in fields such as virtual reality and assistive technologies.

EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos

SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis

AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent Videos With Deep Learning

Synchformer: Efficient Synchronization from Sparse Cues

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound

Audio-Synchronized Visual Animation

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Synthesizing Audio from Silent Video using Sequence to Sequence Modeling

FoleyGAN: Visually Guided Generative Adversarial Network-Based Synchronous Sound Generation in Silent Videos

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

Synchronising audio and ultrasound by learning cross-modal embeddings

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

Video-to-Audio Generation with Hidden Alignment

Unsupervised Audiovisual Synthesis via Exemplar Autoencoders

SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data

Egocentric Audio-Visual Object Localization

A Comprehensive Review and Taxonomy of Audio-Visual Synchronization Techniques for Realistic Speech Animation