Abstract:We introduce a novel approach for high-resolution talking head generation from a single image and audio input. Prior methods using explicit face models, like 3D morphable models (3DMM) and facial landmarks, often fall short in generating high-fidelity videos due to their lack of appearance-aware motion representation. While generative approaches such as video diffusion models achieve high video quality, their slow processing speeds limit practical application. Our proposed model, Implicit Face Motion Diffusion Model (IF-MDM), employs implicit motion to encode human faces into appearance-aware compressed facial latents, enhancing video generation. Although implicit motion lacks the spatial disentanglement of explicit models, which complicates alignment with subtle lip movements, we introduce motion statistics to help capture fine-grained motion information. Additionally, our model provides motion controllability to optimize the trade-off between motion intensity and visual quality during inference. IF-MDM supports real-time generation of 512x512 resolution videos at up to 45 frames per second (fps). Extensive evaluations demonstrate its superior performance over existing diffusion and explicit face models. The code will be released publicly, available alongside supplementary materials. The video results can be found on <a class="link-external link-https" href="https://bit.ly/ifmdm_supplementary" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to generate real - time talking - head videos with high - fidelity. Specifically, the author aims to overcome the challenges encountered by existing methods in generating high - quality videos, including: 1. **Explicit facial models (such as 3D deformable models and facial key points)**: These models are difficult to generate high - fidelity videos due to the lack of appearance - aware motion representations. 2. **Diffusion - model - based methods**: Although these methods can generate high - quality videos, their processing speed is slow, which limits their practical applications. To this end, the author proposes a new method named **Implicit Face Motion Diffusion Model (IF - MDM)**. This model encodes human faces through implicit motions and compresses them into facial latent representations containing appearance information, thereby enhancing the effect of video generation. The main contributions of IF - MDM include: - Proposing a framework that utilizes highly compressed, appearance - aware implicit motion representations for video generation. - Introducing a controllable talking - head generation method that allows for flexible and realistic motion representations. - Demonstrating the advantages and limitations of this method through quantitative and qualitative comparisons with existing methods, providing valuable insights for future research. ### Core Innovations of IF - MDM 1. **Implicit motion representation**: Unlike explicit facial models, IF - MDM uses implicit motion representation, which avoids common artifacts problems (such as torso or background disharmony) and can achieve real - time generation (up to 45 frames per second at a resolution of 512x512). 2. **Motion statistics**: In order to capture fine - grained motion information, the motion mean \( \mu_m \) and standard deviation \( \sigma_m \) are introduced, which help the model better align audio and motion. 3. **Motion controllability**: By adjusting the motion mean and standard deviation, the trade - off between motion intensity and visual quality can be optimized during the inference process. ### Model Architecture The training of IF - MDM is divided into two stages: 1. **Learning disentangled motion and appearance representations**: Extract appearance information through a self - supervised learning framework and generate compact motion details. 2. **Generating natural talking - head motions**: Train an implicit motion generator to generate motion sequences according to the input speech and combine the appearance information to generate the final talking - head video. ### Experimental Results Experiments show that IF - MDM is superior to existing explicit facial models and video diffusion models in multiple aspects, especially in terms of generation speed, image quality, and temporal consistency. Although it is slightly inferior to explicit facial models in lip - sync quality, it has obvious advantages in overall visual effect and real - time performance. In conclusion, by introducing implicit motion representation and motion statistics, IF - MDM successfully solves the deficiencies of existing methods in generating high - fidelity real - time talking - head videos, bringing new breakthroughs to this field.

IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation

MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Audio-driven Talking Face Video Generation with Natural Head Pose

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models

FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion Model

DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion

DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation

EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion

DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models

3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing

DreamHead: Learning Spatial-Temporal Correspondence via Hierarchical Diffusion for Audio-driven Talking Head Synthesis

DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser

High-Fidelity and Freely Controllable Talking Head Video Generation

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models