Abstract:In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. While recent V2S systems have shown promising results on constrained datasets with limited speakers and vocabularies, their performance often degrades on real-world, unconstrained datasets due to the inherent variability and complexity of speech signals. To address these challenges, we decompose the speech signal into manageable subspaces (content, pitch, and speaker information), each representing distinct speech attributes, and predict them directly from the visual input. To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture, which models efficient probabilistic pathways from random noise to the target speech distribution. Extensive experiments demonstrate that V2SFlow significantly outperforms state-of-the-art methods, even surpassing the naturalness of ground truth utterances.

What problem does this paper attempt to address?

This paper attempts to solve the problem of generating natural and understandable speech from silent videos (Video - to - Speech, V2S). Specifically, existing V2S systems perform well on restricted datasets, but their performance degrades on real - world, unrestricted datasets, mainly due to the inherent complexity and variability of speech signals. These problems include: 1. **Complexity of speech signals**: Speech signals contain multiple acoustic components, such as language content and speaker characteristics, and these components interact in a complex way. 2. **Ambiguity between lip movements and speech**: There is a certain ambiguity between lip movements and the corresponding speech, making it difficult to accurately generate speech. To solve these problems, the authors propose a new framework named V2SFlow. This framework improves existing methods in the following ways: - **Speech decomposition**: Decompose the speech signal into three basic and manageable sub - spaces: Content, Pitch, and Speaker Characteristics. Each sub - space represents different speech attributes, thereby simplifying the task of predicting these attributes from visual input. \[ \text{Speech Decomposition: } \text{Content}, \text{Pitch}, \text{Speaker Information} \] - **Rectified Flow Matching decoder**: Use a Rectified Flow Matching (RFM) decoder based on the Transformer architecture to model an efficient probability path from random noise to the target speech distribution, thereby generating coherent and realistic speech. Through these improvements, V2SFlow significantly outperforms existing methods and even exceeds real - world speech in terms of naturalness. Experimental results show that V2SFlow has achieved excellent performance on multiple evaluation metrics, especially in terms of naturalness, comprehensibility, and similarity. In summary, this paper aims to enhance the ability to generate high - quality speech from silent videos through innovative speech decomposition and rectified flow matching techniques, especially for applications in real - world scenarios.

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

TalkingFlow: Talking Facial Landmark Generation with Multi-Scale Normalizing Flow Network

Flow-Based Unconstrained Lip to Speech Generation

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching

DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding

Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow Matching

OpFlowTalker: Realistic and Natural Talking Face Generation via Optical Flow Guidance

Improving Unsupervised Video Object Segmentation via Fake Flow Generation

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

Real-Time and Accurate: Zero-shot High-Fidelity Singing Voice Conversion with Multi-Condition Flow Synthesis

Creating New Voices using Normalizing Flows

DVCFlow: Modeling Information Flow Towards Human-like Video Captioning

FlowSep: Language-Queried Sound Separation with Rectified Flow Matching

VoiceLens: Controllable Speaker Generation and Editing with Flow

FloWaveNet : A Generative Flow for Raw Audio

Compositional Video Generation as Flow Equalization

Video Frame Synthesis using Deep Voxel Flow

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

Audio Dequantization for High Fidelity Audio Generation in Flow-based Neural Vocoder