Abstract:Talking Head Generation (THG), typically driven by audio, is an important and challenging task with broad application prospects in various fields such as digital humans, film production, and virtual reality. While diffusion model-based THG methods present high quality and stable content generation, they often overlook the intrinsic style which encompasses personalized features such as speaking habits and facial expressions of a video. As consequence, the generated video content lacks diversity and vividness, thus being limited in real life scenarios. To address these issues, we propose a novel framework named Style-Enhanced Vivid Portrait (SVP) which fully leverages style-related information in THG. Specifically, we first introduce the novel probabilistic style prior learning to model the intrinsic style as a Gaussian distribution using facial expressions and audio embedding. The distribution is learned through the 'bespoked' contrastive objective, effectively capturing the dynamic style information in each video. Then we finetune a pretrained Stable Diffusion (SD) model to inject the learned intrinsic style as a controlling signal via cross attention. Experiments show that our model generates diverse, vivid, and high-quality videos with flexible control over intrinsic styles, outperforming existing state-of-the-art methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the video content generated by the existing Talking Head Generation (THG) methods based on diffusion models is monotonous, lacks diversity and vividness, especially in terms of capturing and expressing personalized features (such as speaking habits and facial expressions). This leads to limitations in the application of the generated videos in real - world scenarios. Specifically, the paper points out that although current methods can generate high - quality and stable video content, they often overlook the intrinsic style, that is, the unique attributes of individuals in the video, such as different speaking habits and emotional expressions. These personalized details are crucial for generating realistic Talking Head videos, but previous methods are difficult to infer from common conditions (such as facial key points). To solve these problems, the authors propose a new framework named Style - Enhanced Vivid Portrait (SVP), which makes full use of style - related information to enhance the effect of Talking Head generation. Specific improvements include: 1. **Introducing Probabilistic Style Prior Learning**: Extract style features from audio information through a self - supervised method and model them as a Gaussian distribution, thereby capturing the dynamic style information in each video. 2. **Improving the Control Signal of the Diffusion Model**: Inject the learned intrinsic style into the pre - trained Stable Diffusion model through a cross - attention mechanism to achieve effective control of the generation process. 3. **Enhancing Style Transfer Ability**: Through the learning of different emotions and expressions, the generated videos not only have high quality but can also flexibly show a variety of emotional changes. Through these innovations, the SVP framework can significantly improve the diversity and vividness of the generated videos while maintaining high fidelity, surpassing the existing state - of - the - art methods. ### Formula Summary - **Noise Prediction Loss**: \[ L_{\text{denoising}}=\mathbb{E}_{z_t, \epsilon, c, t}\left[\left\|\epsilon_\theta(z_t, c, t)-\epsilon_t\right\|^2\right] \] where \(z_t = \sqrt{\alpha_t}z_0+\sqrt{1 - \alpha_t}\epsilon_t\), \(\epsilon_t\) is the added noise, and \(\epsilon_\theta\) is the noise predicted by the UNet model. - **Gaussian Distribution Parameter Calculation**: \[ \mu_s=\text{softmax}(W_s\hat{s})\cdot\hat{s}^T \] \[ \sigma_s=\text{softmax}(W_s\hat{s})\cdot(\hat{s}^T - \mu_s)^2 \] \[ s=\mu_s+\sigma_s\cdot\epsilon,\quad\epsilon\sim\mathcal{N}(0, I) \] - **Contrast Loss**: \[ L_{\text{con}}=-\log\left(\frac{\omega(s_p)}{\omega(s_p)+\sum_{s_n\in S_n}\omega(s_n)}\right) \] where \(\omega(\tilde{s})=\exp\left(\frac{\zeta(s, \tilde{s})}{\tau}\right)\) and \(\zeta(s_i, s_j)=\frac{1}{\|s_i - s_j\|_2 + 1}\). These formulas ensure that the model can effectively learn and utilize the intrinsic style during the training process.

SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads

StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

Say Anything with Any Style

VAST: Vivify Your Talking Avatar Via Zero-Shot Expressive Facial Style Transfer

DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models

Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

Audio-driven Talking Face Video Generation with Natural Head Pose

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

Style Transfer for 2D Talking Head Animation

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN

Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis

Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face Synthesis

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

TalkingStyle: Personalized Speech-Driven 3D Facial Animation with Style Preservation