Abstract:Individuals have unique facial expression and head pose styles that reflect their personalized speaking styles. Existing one-shot talking head methods cannot capture such personalized characteristics and therefore fail to produce diverse speaking styles in the final videos. To address this challenge, we propose a one-shot style-controllable talking face generation method that can obtain speaking styles from reference speaking videos and drive the one-shot portrait to speak with the reference speaking styles and another piece of audio. Our method aims to synthesize the style-controllable coefficients of a 3D Morphable Model (3DMM), including facial expressions and head movements, in a unified framework. Specifically, the proposed framework first leverages a style encoder to extract the desired speaking styles from the reference videos and transform them into style codes. Then, the framework uses a style-aware decoder to synthesize the coefficients of 3DMM from the audio input and style codes. During decoding, our framework adopts a two-branch architecture, which generates the stylized facial expression coefficients and stylized head movement coefficients, respectively. After obtaining the coefficients of 3DMM, an image renderer renders the expression coefficients into a specific person's talking-head video. Extensive experiments demonstrate that our method generates visually authentic talking head videos with diverse speaking styles from only one portrait image and an audio clip.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "StyleTalk++: A Unified Framework for Controlling the Speaking Style of Talking Heads" aims to solve the problem that existing single - shot methods are unable to capture personalized features, thus generating diverse speaking styles in the final video. Specifically, the paper proposes a new single - shot style - controllable talking face generation method, which can extract the speaking style from the reference speaking video and drive the single - shot portrait to speak with the reference speaking style and another piece of audio. This method achieves this goal by synthesizing the style - controllable coefficients of the 3D Morphable Model (3DMM), which include facial expressions and head movements. ### Main contributions 1. **Style encoder**: A general - purpose style encoder is designed to model the motion patterns of facial expressions and head postures from any reference - style video. The style encoder uses a Transformer encoder to study the spatio - temporal co - activation patterns of input sequence parameters and embeds them into the style code through a self - attention pooling layer. 2. **Style - aware decoder**: A style - aware decoder is introduced to synthesize stylized animation parameters from audio based on the style code. The decoder uses a Transformer decoder as the backbone and uses the style code as a query to guide the model to closely associate the audio representation with a specific style through a cross - attention mechanism, thereby enhancing the synthesis of stylized animations. 3. **Two - branch architecture**: The framework is extended into two branches to generate stylized facial expressions and head postures respectively. For stylized facial expressions, the kernel weights of the adaptive generation feed - forward layer are proposed; for stylized head postures, a recursive mechanism is introduced to gradually predict head movements. 4. **Image renderer**: Finally, the image renderer converts the generated 3DMM coefficients and the reference image into a realistic talking head video. ### Method overview - **3D face reconstruction**: Use the 3DMM model to represent the face shape and extract 3DMM coefficients from the portrait image. - **Style encoder**: Extract the style codes of facial expressions and head postures from the reference video. - **Acoustic encoder**: Process the audio into acoustic features, providing rhythm and intonation information related to head movements. - **Style - aware head - posture decoder**: Generate stylized head movements based on Transformer - XL, considering long - term audio rhythms and immediate head - posture states. - **Image renderer**: Convert the generated 3DMM coefficients and the reference image into a video. ### Experimental results Extensive experiments show that this method can generate talking head videos that are visually realistic and have diverse speaking styles, while satisfying accurate lip - synchronization, convincing facial expressions, and natural head movements. ### Formulas - **Generation of style code**: \[ s=\text{softmax}(W_sH)H^T \] where \(W_s\in\mathbb{R}^{1\times d_s}\) is a trainable parameter, \(H = [s_1,\ldots,s_N]\in\mathbb{R}^{d_s\times N}\) is the encoded feature sequence, and \(d_s\) is the dimension of each style vector. - **Triplet loss**: \[ L_{\text{trip}}=\max\{\|s_c - s_p^c\|_2-\|s_c - s_n^c\|_2+\gamma,0\} \] where \(\gamma\) is a margin parameter, set to 5. - **Head - posture generation**: \[ (c_i,e_i)=\text{TransXL}(c_{i - 1},a_i'\oplus e_{i - 1}\oplus s_h) \] Finally, use a fully - connected layer to decode \(e_i\) into head posture \(h_i\in\mathbb{R}^6\). - **Head - posture reconstruction constraint**: \[ L_{\text{SSIM}} = 1-\frac{(2\mu\hat{\m

StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads

StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles

Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis

Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face Synthesis

Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

TalkingStyle: Personalized Speech-Driven 3D Facial Animation with Style Preservation

Style Transfer for 2D Talking Head Animation

Audio-driven Talking Face Video Generation with Natural Head Pose

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

Say Anything with Any Style

Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

MakeItTalk: Speaker-Aware Talking-Head Animation

SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation.

3D Talking Face with Personalized Pose Dynamics

StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN