Weipeng Tan,Chuming Lin,Chengming Xu,Xiaozhong Ji,Junwei Zhu,Chengjie Wang,Yanwei Fu
Abstract:Talking Head Generation (THG), typically driven by audio, is an important and challenging task with broad application prospects in various fields such as digital humans, film production, and virtual reality. While diffusion model-based THG methods present high quality and stable content generation, they often overlook the intrinsic style which encompasses personalized features such as speaking habits and facial expressions of a video. As consequence, the generated video content lacks diversity and vividness, thus being limited in real life scenarios. To address these issues, we propose a novel framework named Style-Enhanced Vivid Portrait (SVP) which fully leverages style-related information in THG. Specifically, we first introduce the novel probabilistic style prior learning to model the intrinsic style as a Gaussian distribution using facial expressions and audio embedding. The distribution is learned through the 'bespoked' contrastive objective, effectively capturing the dynamic style information in each video. Then we finetune a pretrained Stable Diffusion (SD) model to inject the learned intrinsic style as a controlling signal via cross attention. Experiments show that our model generates diverse, vivid, and high-quality videos with flexible control over intrinsic styles, outperforming existing state-of-the-art methods.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the video content generated by the existing Talking Head Generation (THG) methods based on diffusion models is monotonous, lacks diversity and vividness, especially in terms of capturing and expressing personalized features (such as speaking habits and facial expressions). This leads to limitations in the application of the generated videos in real - world scenarios.
Specifically, the paper points out that although current methods can generate high - quality and stable video content, they often overlook the intrinsic style, that is, the unique attributes of individuals in the video, such as different speaking habits and emotional expressions. These personalized details are crucial for generating realistic Talking Head videos, but previous methods are difficult to infer from common conditions (such as facial key points).
To solve these problems, the authors propose a new framework named Style - Enhanced Vivid Portrait (SVP), which makes full use of style - related information to enhance the effect of Talking Head generation. Specific improvements include:
1. **Introducing Probabilistic Style Prior Learning**: Extract style features from audio information through a self - supervised method and model them as a Gaussian distribution, thereby capturing the dynamic style information in each video.
2. **Improving the Control Signal of the Diffusion Model**: Inject the learned intrinsic style into the pre - trained Stable Diffusion model through a cross - attention mechanism to achieve effective control of the generation process.
3. **Enhancing Style Transfer Ability**: Through the learning of different emotions and expressions, the generated videos not only have high quality but can also flexibly show a variety of emotional changes.
Through these innovations, the SVP framework can significantly improve the diversity and vividness of the generated videos while maintaining high fidelity, surpassing the existing state - of - the - art methods.
### Formula Summary
- **Noise Prediction Loss**:
\[
L_{\text{denoising}}=\mathbb{E}_{z_t, \epsilon, c, t}\left[\left\|\epsilon_\theta(z_t, c, t)-\epsilon_t\right\|^2\right]
\]
where \(z_t = \sqrt{\alpha_t}z_0+\sqrt{1 - \alpha_t}\epsilon_t\), \(\epsilon_t\) is the added noise, and \(\epsilon_\theta\) is the noise predicted by the UNet model.
- **Gaussian Distribution Parameter Calculation**:
\[
\mu_s=\text{softmax}(W_s\hat{s})\cdot\hat{s}^T
\]
\[
\sigma_s=\text{softmax}(W_s\hat{s})\cdot(\hat{s}^T - \mu_s)^2
\]
\[
s=\mu_s+\sigma_s\cdot\epsilon,\quad\epsilon\sim\mathcal{N}(0, I)
\]
- **Contrast Loss**:
\[
L_{\text{con}}=-\log\left(\frac{\omega(s_p)}{\omega(s_p)+\sum_{s_n\in S_n}\omega(s_n)}\right)
\]
where \(\omega(\tilde{s})=\exp\left(\frac{\zeta(s, \tilde{s})}{\tau}\right)\) and \(\zeta(s_i, s_j)=\frac{1}{\|s_i - s_j\|_2 + 1}\).
These formulas ensure that the model can effectively learn and utilize the intrinsic style during the training process.