EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos

Aashish Rai,Srinath Sridhar
2024-07-30
Abstract:We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos. Generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets. Existing work has been limited to domains like speech, music, or impact sounds and cannot easily capture the broad range of audio frequencies found in egocentric videos. EgoSonics addresses these limitations by building on the strength of latent diffusion models for conditioned audio synthesis. We first encode and process audio and video data into a form that is suitable for generation. The encoded data is used to train our model to generate audio tracks that capture the semantics of the input video. Our proposed SyncroNet builds on top of ControlNet to provide control signals that enables temporal synchronization to the synthesized audio. Extensive evaluations show that our model outperforms existing work in audio quality, and in our newly proposed synchronization evaluation method. Furthermore, we demonstrate downstream applications of our model in improving video summarization.
Computer Vision and Pattern Recognition,Multimedia,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to generate synchronous and semantically meaningful audio tracks for silent first - person perspective (egocentric) videos. Specifically, the author focuses on first - person videos in daily activities, which are usually filmed by head - mounted or body - mounted cameras, providing the perspective of the observer. ### Problem Background Existing work has the following limitations in generating synchronous audio: 1. **Domain Limitation**: Most existing methods are limited to specific domains, such as speech, music, or impact sounds, and are unable to capture the wide range of audio frequencies present in daily activities. 2. **Poor Temporal Synchronization**: Many models are unable to generate audio that is synchronous with the input video, especially when dealing with complex temporal and spatial understanding. 3. **Limited Audio Quality**: Existing models can usually only handle low - frequency audio (less than 8 KHz) and the generated audio quality is not high. 4. **Frame Rate Limitation**: Existing methods usually use only a small number of frames (1 - 4 frames) and are unable to capture the dynamic changes in the video. ### Solution To solve the above problems, EgoSonics proposes a method based on Latent Diffusion Models (LDMs) to achieve high - quality, synchronous audio generation through the following steps: 1. **Data Encoding and Processing**: First, encode and process the paired audio - video data to make it suitable for the generation task. 2. **SyncroNet Module**: Introduce the SyncroNet module, which is based on ControlNet, extracts temporal information from the video through self - attention and cross - attention mechanisms, and generates control signals to guide audio generation. 3. **Spatio - Temporal Alignment**: Represent the audio as a short - time Fourier transform (STFT) image and use the Stable Diffusion model to generate an audio spectrogram to ensure that the generated audio is aligned with the video in time and space. 4. **High - Resolution Audio Generation**: Use a video frame rate of 30 fps and an audio sampling rate of up to 20 KHz to generate high - quality audio. ### Main Contributions 1. **Propose EgoSonics**: A method that can generate synchronous and semantically meaningful audio for silent first - person videos. 2. **SyncroNet Module**: Can extract temporal information from the video and generate control signals to achieve better audio generation synchronization. 3. **New Synchronization Evaluation Method**: Propose a new method to accurately measure the synchronization quality between the generated audio and the input video. Through these improvements, EgoSonics significantly outperforms existing methods in terms of audio quality and synchronization and shows its potential applications in fields such as virtual reality and assistive technologies.