Synthesizing Audio from Silent Video using Sequence to Sequence Modeling

Hugo Garrido-Lestache Belinchon,Helina Mulugeta,Adam Haile
2024-04-26
Abstract:Generating audio from a video's visual context has multiple practical applications in improving how we interact with audio-visual media - for example, enhancing CCTV footage analysis, restoring historical videos (e.g., silent movies), and improving video generation models. We propose a novel method to generate audio from video using a sequence-to-sequence model, improving on prior work that used CNNs and WaveNet and faced sound diversity and generalization challenges. Our approach employs a 3D Vector Quantized Variational Autoencoder (VQ-VAE) to capture the video's spatial and temporal structures, decoding with a custom audio decoder for a broader range of sounds. Trained on the Youtube8M dataset segment, focusing on specific domains, our model aims to enhance applications like CCTV footage analysis, silent movie restoration, and video generation models.
Sound,Artificial Intelligence,Computer Vision and Pattern Recognition,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate audio from the visual context of videos. Specifically, the author proposes a new method, using a sequence - to - sequence model to generate audio from videos, aiming to improve the deficiencies in previous work, such as the challenges of sound diversity and generalization ability. This method is especially suitable for areas such as enhancing CCTV video analysis, restoring historical videos (for example, silent movies), and improving video generation models. ### Main problems: 1. **Sound diversity**: When generating audio, previous methods had a relatively single type of sound and could not cover a wide range of audio types. 2. **Generalization ability**: Previous models had limited generalization ability when dealing with videos in different fields and could usually only handle specific types of videos. 3. **Efficiency problem**: Previous models were slow in generating audio, especially when processing videos frame by frame. ### Solutions: - **3D Vector Quantized Variational Autoencoder (VQ - VAE)**: Used to capture the spatial and temporal structures of videos and generate discrete representations of videos. - **Custom - made audio decoder**: Used to generate a wider range of sound types from the discrete representations of videos. - **Sequence - to - sequence model**: Through end - to - end training, improve the efficiency and generalization ability of the model. ### Goals: - **Enhanced applications**: Improve CCTV video analysis, restore historical videos (such as silent movies), and improve video generation models. - **Improved efficiency**: Through a more efficient model architecture, accelerate the speed of audio generation. - **Expanded scope of application**: Enable the model to handle more types of videos, not just specific types of videos. Through these methods, the paper aims to provide a more efficient and more general - purpose solution to generate high - quality audio from videos.