VarietySound: Timbre-Controllable Video to Sound Generation Via Unsupervised Information Disentanglement

Chenye Cui,Yi Ren,Jinglin Liu,Rongjie Huang,Zhou Zhao
DOI: https://doi.org/10.1109/icassp49357.2023.10096353
2023-01-01
ICASSP
Abstract:Video-to-sound generation aims to generate realistic and natural sound given a video input. However, previous video-to-sound generation methods can only generate a random or average timbre without any controls of the generated sound timbre, leading to the problem that people cannot obtain the desired timbre under these methods sometimes. In this paper, we propose the task of generating sound with a specific timbre given a silent video input and a reference audio sample. To solve this task, we first use three encoders to disentangle each target sound audio into temporal, acoustic, and background information respectively, then we use a decoder to reconstruct the audio given these disentangled representations. To make the generated result achieve better quality and temporal alignment, we also adopt a mel discriminator and a temporal discriminator for the adversarial training. Our experimental results on the VAS dataset demonstrate that our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio. Our demos have been published on https://conferencedemos.github.io/icassp23/.
What problem does this paper attempt to address?