Abstract:Recent research has demonstrated impressive results in video-to-speech synthesis which involves reconstructing speech solely from visual input. However, previous works have struggled to accurately synthesize speech due to a lack of sufficient guidance for the model to infer the correct content with the appropriate sound. To resolve the issue, they have adopted an extra speaker embedding as a speaking style guidance from a reference auditory information. Nevertheless, it is not always possible to obtain the audio information from the corresponding video input, especially during the inference time. In this paper, we present a novel vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique. In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not necessary during the inference time. Using the extracted vision-guided speaker embedding representations, we further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video. The proposed DiffV2S not only maintains phoneme details contained in the input video frames, but also creates a highly intelligible mel-spectrogram in which the speaker identities of the multiple speakers are all preserved. Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

FastFoley: Non-autoregressive Foley Sound Generation Based on Visual Semantics

Contrastive Conditional Latent Diffusion for Audio-visual Segmentation

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound

Video-to-Audio Generation with Fine-grained Temporal Semantics

Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis

LoVA: Long-form Video-to-Audio Generation

FoleyGen: Visually-Guided Audio Generation

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

Diffusion Models as Masked Audio-Video Learners

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

Text-Driven Foley Sound Generation With Latent Diffusion Model

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding