FOLEY-VAE: Generación de efectos de audio para cine con inteligencia artificial

Mateo Cámara,José Luis Blanco

2023-10-24

Abstract:In this research, we present an interface based on Variational Autoencoders trained with a wide range of natural sounds for the innovative creation of Foley effects. The model can transfer new sound features to prerecorded audio or microphone-captured speech in real time. In addition, it allows interactive modification of latent variables, facilitating precise and customized artistic adjustments. Taking as a starting point our previous study on Variational Autoencoders presented at this same congress last year, we analyzed an existing implementation: RAVE [1]. This model has been specifically trained for audio effects production. Various audio effects have been successfully generated, ranging from electromagnetic, science fiction, and water sounds, among others published with this work. This innovative approach has been the basis for the artistic creation of the first Spanish short film with sound effects assisted by artificial intelligence. This milestone illustrates palpably the transformative potential of this technology in the film industry, opening the door to new possibilities for sound creation and the improvement of artistic quality in film productions.

Audio and Speech Processing,Sound

What problem does this paper attempt to address?

The paper primarily explores an innovative system based on Variational Autoencoders (VAE), named Foley-VAE, for generating sound effects in films. This system aims to address the limitations of traditional sound effect generation methods, such as the challenges of physical simulation techniques in generating complex sounds, high computational costs, and difficulty in capturing rich timbral details. Specifically, the study attempts to solve the following key issues: 1. **Innovative Sound Effect Generation**: Creating new and creative Foley sound effects by using a variational autoencoder trained on a broad dataset of natural sounds. 2. **Real-time Audio Processing**: The model is capable of real-time processing, transferring new features to pre-recorded audio or sounds captured by a microphone. 3. **Interactive Adjustment**: Allowing users to interactively modify latent variables to achieve precise and personalized artistic adjustments. 4. **Enhancing Film Sound Quality**: This method was used as the basis for creating the first Spanish short film with AI-assisted sound effects, demonstrating the transformative potential of this technology in the film industry and opening up new possibilities for sound effect creation and artistic quality enhancement in film production. In short, the goal of this paper is to demonstrate a new method for generating high-quality, innovative, and easy-to-operate film sound effects using artificial intelligence. This approach can greatly enrich the soundscape of films, providing filmmakers with more powerful and flexible tools to realize their creative visions.

FOLEY-VAE: Generación de efectos de audio para cine con inteligencia artificial

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent Videos With Deep Learning

FoleyGen: Visually-Guided Audio Generation

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

AudioVSR: Enhancing Video Speech Recognition with Audio Data

FastFoley: Non-autoregressive Foley Sound Generation Based on Visual Semantics

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

Del Visual al Auditivo: Sonorización de Escenas Guiada por Imagen

Video-Guided Foley Sound Generation with Multimodal Controls

Exploring Domain-Specific Enhancements for a Neural Foley Synthesizer

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound

FoleyGAN: Visually Guided Generative Adversarial Network-Based Synchronous Sound Generation in Silent Videos

Audio-Visual Target Speaker Enhancement on Multi-Talker Environment using Event-Driven Cameras

Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries

Decoding Vocal Articulations from Acoustic Latent Representations

Conditional Generation of Audio from Video via Foley Analogies

Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis

RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models