FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Taekyung Ki,Dongchan Min,Gyoungsu Chae
2024-12-02
Abstract:With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. We shift the generative modeling from the pixel-based latent space to a learned motion latent space, enabling efficient design of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with a simple yet effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Multimedia,Image and Video Processing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key challenges in audio - driven talking portrait video generation: 1. **Temporal Consistency**: Existing diffusion - model - based methods face difficulties in generating long - time coherent videos, resulting in the problem of inter - frame inconsistency in the videos. For example, a person's head movement or facial expression may have jumps or unnatural changes between frames. 2. **Sampling Efficiency**: Diffusion models usually require iterative sampling, which makes it inefficient as it may take several minutes to generate a video of a few seconds. 3. **Emotional Expression**: The generated talking portrait videos lack the naturalness and diversity of emotional expression and it is difficult to accurately reflect the emotional state of the speaker according to the audio content. 4. **Dependence on Auxiliary Information**: Many existing methods rely on additional facial prior information (such as bounding boxes, 2D landmarks, skeletons or 3D meshes), which limits the diversity and fidelity of head movements and introduces strong spatial biases. To solve these problems, the authors propose a new method named FLOAT (Generative Motion Latent Flow Matching for Audio - driven Talking Portrait). Specifically, FLOAT improves existing methods in the following ways: - **Flow - Matching Generative Model**: Transforms the generative model from the pixel - level latent space to the learned motion latent space, thus more efficiently designing temporally consistent motions. - **Transformer - Based Vector Field Predictor**: Introduces a simple and effective Transformer - Based vector field predictor, which can perform a frame - by - frame conditional mechanism to ensure the temporal consistency of the generated motions. - **Voice - Driven Emotional Enhancement**: Supports enhancing emotion - related motions through voice - driven emotion labels, making the generated videos more natural and expressive. Through these improvements, FLOAT outperforms existing audio - driven talking portrait generation methods in terms of visual quality, motion fidelity and efficiency.