V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

Jeongsoo Choi,Ji-Hoon Kim,Jinyu Li,Joon Son Chung,Shujie Liu
2024-11-29
Abstract:In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. While recent V2S systems have shown promising results on constrained datasets with limited speakers and vocabularies, their performance often degrades on real-world, unconstrained datasets due to the inherent variability and complexity of speech signals. To address these challenges, we decompose the speech signal into manageable subspaces (content, pitch, and speaker information), each representing distinct speech attributes, and predict them directly from the visual input. To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture, which models efficient probabilistic pathways from random noise to the target speech distribution. Extensive experiments demonstrate that V2SFlow significantly outperforms state-of-the-art methods, even surpassing the naturalness of ground truth utterances.
Computer Vision and Pattern Recognition,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve the problem of generating natural and understandable speech from silent videos (Video - to - Speech, V2S). Specifically, existing V2S systems perform well on restricted datasets, but their performance degrades on real - world, unrestricted datasets, mainly due to the inherent complexity and variability of speech signals. These problems include: 1. **Complexity of speech signals**: Speech signals contain multiple acoustic components, such as language content and speaker characteristics, and these components interact in a complex way. 2. **Ambiguity between lip movements and speech**: There is a certain ambiguity between lip movements and the corresponding speech, making it difficult to accurately generate speech. To solve these problems, the authors propose a new framework named V2SFlow. This framework improves existing methods in the following ways: - **Speech decomposition**: Decompose the speech signal into three basic and manageable sub - spaces: Content, Pitch, and Speaker Characteristics. Each sub - space represents different speech attributes, thereby simplifying the task of predicting these attributes from visual input. \[ \text{Speech Decomposition: } \text{Content}, \text{Pitch}, \text{Speaker Information} \] - **Rectified Flow Matching decoder**: Use a Rectified Flow Matching (RFM) decoder based on the Transformer architecture to model an efficient probability path from random noise to the target speech distribution, thereby generating coherent and realistic speech. Through these improvements, V2SFlow significantly outperforms existing methods and even exceeds real - world speech in terms of naturalness. Experimental results show that V2SFlow has achieved excellent performance on multiple evaluation metrics, especially in terms of naturalness, comprehensibility, and similarity. In summary, this paper aims to enhance the ability to generate high - quality speech from silent videos through innovative speech decomposition and rectified flow matching techniques, especially for applications in real - world scenarios.