Abstract:With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. We shift the generative modeling from the pixel-based latent space to a learned motion latent space, enabling efficient design of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with a simple yet effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key challenges in audio - driven talking portrait video generation: 1. **Temporal Consistency**: Existing diffusion - model - based methods face difficulties in generating long - time coherent videos, resulting in the problem of inter - frame inconsistency in the videos. For example, a person's head movement or facial expression may have jumps or unnatural changes between frames. 2. **Sampling Efficiency**: Diffusion models usually require iterative sampling, which makes it inefficient as it may take several minutes to generate a video of a few seconds. 3. **Emotional Expression**: The generated talking portrait videos lack the naturalness and diversity of emotional expression and it is difficult to accurately reflect the emotional state of the speaker according to the audio content. 4. **Dependence on Auxiliary Information**: Many existing methods rely on additional facial prior information (such as bounding boxes, 2D landmarks, skeletons or 3D meshes), which limits the diversity and fidelity of head movements and introduces strong spatial biases. To solve these problems, the authors propose a new method named FLOAT (Generative Motion Latent Flow Matching for Audio - driven Talking Portrait). Specifically, FLOAT improves existing methods in the following ways: - **Flow - Matching Generative Model**: Transforms the generative model from the pixel - level latent space to the learned motion latent space, thus more efficiently designing temporally consistent motions. - **Transformer - Based Vector Field Predictor**: Introduces a simple and effective Transformer - Based vector field predictor, which can perform a frame - by - frame conditional mechanism to ensure the temporal consistency of the generated motions. - **Voice - Driven Emotional Enhancement**: Supports enhancing emotion - related motions through voice - driven emotion labels, making the generated videos more natural and expressive. Through these improvements, FLOAT outperforms existing audio - driven talking portrait generation methods in terms of visual quality, motion fidelity and efficiency.

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

TalkingFlow: Talking Facial Landmark Generation with Multi-Scale Normalizing Flow Network

LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis

OpFlowTalker: Realistic and Natural Talking Face Generation via Optical Flow Guidance

Animating Portrait Line Drawings from a Single Face Photo and a Speech Signal

GMTalker: Gaussian Mixture-based Audio-Driven Emotional talking video Portraits

Audio-driven Talking Face Video Generation with Natural Head Pose

SeamsTalk: Seamless Talking Face Generation via Flow-Guided Inpainting

DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

GMTalker: Gaussian Mixture Based Emotional Talking Video Portraits

Stochastic Latent Talking Face Generation Toward Emotional Expressions and Head Poses

Spatially and Temporally Optimized Audio‐Driven Talking Face Generation

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Toward Fine-Grained Talking Face Generation

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion