StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads

Suzhen Wang,Yifeng Ma,Yu Ding,Zhipeng Hu,Changjie Fan,Tangjie Lv,Zhidong Deng,Xin Yu
2024-09-14
Abstract:Individuals have unique facial expression and head pose styles that reflect their personalized speaking styles. Existing one-shot talking head methods cannot capture such personalized characteristics and therefore fail to produce diverse speaking styles in the final videos. To address this challenge, we propose a one-shot style-controllable talking face generation method that can obtain speaking styles from reference speaking videos and drive the one-shot portrait to speak with the reference speaking styles and another piece of audio. Our method aims to synthesize the style-controllable coefficients of a 3D Morphable Model (3DMM), including facial expressions and head movements, in a unified framework. Specifically, the proposed framework first leverages a style encoder to extract the desired speaking styles from the reference videos and transform them into style codes. Then, the framework uses a style-aware decoder to synthesize the coefficients of 3DMM from the audio input and style codes. During decoding, our framework adopts a two-branch architecture, which generates the stylized facial expression coefficients and stylized head movement coefficients, respectively. After obtaining the coefficients of 3DMM, an image renderer renders the expression coefficients into a specific person's talking-head video. Extensive experiments demonstrate that our method generates visually authentic talking head videos with diverse speaking styles from only one portrait image and an audio clip.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper "StyleTalk++: A Unified Framework for Controlling the Speaking Style of Talking Heads" aims to solve the problem that existing single - shot methods are unable to capture personalized features, thus generating diverse speaking styles in the final video. Specifically, the paper proposes a new single - shot style - controllable talking face generation method, which can extract the speaking style from the reference speaking video and drive the single - shot portrait to speak with the reference speaking style and another piece of audio. This method achieves this goal by synthesizing the style - controllable coefficients of the 3D Morphable Model (3DMM), which include facial expressions and head movements. ### Main contributions 1. **Style encoder**: A general - purpose style encoder is designed to model the motion patterns of facial expressions and head postures from any reference - style video. The style encoder uses a Transformer encoder to study the spatio - temporal co - activation patterns of input sequence parameters and embeds them into the style code through a self - attention pooling layer. 2. **Style - aware decoder**: A style - aware decoder is introduced to synthesize stylized animation parameters from audio based on the style code. The decoder uses a Transformer decoder as the backbone and uses the style code as a query to guide the model to closely associate the audio representation with a specific style through a cross - attention mechanism, thereby enhancing the synthesis of stylized animations. 3. **Two - branch architecture**: The framework is extended into two branches to generate stylized facial expressions and head postures respectively. For stylized facial expressions, the kernel weights of the adaptive generation feed - forward layer are proposed; for stylized head postures, a recursive mechanism is introduced to gradually predict head movements. 4. **Image renderer**: Finally, the image renderer converts the generated 3DMM coefficients and the reference image into a realistic talking head video. ### Method overview - **3D face reconstruction**: Use the 3DMM model to represent the face shape and extract 3DMM coefficients from the portrait image. - **Style encoder**: Extract the style codes of facial expressions and head postures from the reference video. - **Acoustic encoder**: Process the audio into acoustic features, providing rhythm and intonation information related to head movements. - **Style - aware head - posture decoder**: Generate stylized head movements based on Transformer - XL, considering long - term audio rhythms and immediate head - posture states. - **Image renderer**: Convert the generated 3DMM coefficients and the reference image into a video. ### Experimental results Extensive experiments show that this method can generate talking head videos that are visually realistic and have diverse speaking styles, while satisfying accurate lip - synchronization, convincing facial expressions, and natural head movements. ### Formulas - **Generation of style code**: \[ s=\text{softmax}(W_sH)H^T \] where \(W_s\in\mathbb{R}^{1\times d_s}\) is a trainable parameter, \(H = [s_1,\ldots,s_N]\in\mathbb{R}^{d_s\times N}\) is the encoded feature sequence, and \(d_s\) is the dimension of each style vector. - **Triplet loss**: \[ L_{\text{trip}}=\max\{\|s_c - s_p^c\|_2-\|s_c - s_n^c\|_2+\gamma,0\} \] where \(\gamma\) is a margin parameter, set to 5. - **Head - posture generation**: \[ (c_i,e_i)=\text{TransXL}(c_{i - 1},a_i'\oplus e_{i - 1}\oplus s_h) \] Finally, use a fully - connected layer to decode \(e_i\) into head posture \(h_i\in\mathbb{R}^6\). - **Head - posture reconstruction constraint**: \[ L_{\text{SSIM}} = 1-\frac{(2\mu\hat{\m