Abstract:Generating stylized talking head with diverse head motions is crucial for achieving natural-looking videos but still remains challenging. Previous works either adopt a regressive method to capture the speaking style, resulting in a coarse style that is averaged across all training data, or employ a universal network to synthesize videos with different styles which causes suboptimal performance. To address these, we propose a novel dynamic-weight method, namely Say Anything withAny Style (SAAS), which queries the discrete style representation via a generative model with a learned style codebook. Specifically, we develop a multi-task VQ-VAE that incorporates three closely related tasks to learn a style codebook as a prior for style extraction. This discrete prior, along with the generative model, enhances the precision and robustness when extracting the speaking styles of the given style clips. By utilizing the extracted style, a residual architecture comprising a canonical branch and style-specific branch is employed to predict the mouth shapes conditioned on any driving audio while transferring the speaking style from the source to any desired one. To adapt to different speaking styles, we steer clear of employing a universal network by exploring an elaborate HyperStyle to produce the style-specific weights offset for the style branch. Furthermore, we construct a pose generator and a pose codebook to store the quantized pose representation, allowing us to sample diverse head motions aligned with the audio and the extracted style. Experiments demonstrate that our approach surpasses state-of-theart methods in terms of both lip-synchronization and stylized expression. Besides, we extend our SAAS to video-driven style editing field and achieve satisfactory performance.

The Importance Weighted Autoencoder in End-to-End Speech Synthesis

Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis

Extracting and Predicting Word-Level Style Variations for Speech Synthesis

End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training

Learning Hierarchical Representations for Expressive Speaking Style in End-to-End Speech Synthesis

Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis

Innovative Speaker-Adaptive Style Transfer VAE-WadaIN for Enhanced Voice Conversion in Intelligent Speech Processing

StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis

Using Multiple Reference Audios and Style Embedding Constraints for Speech Synthesis

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder

Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis

Rapid Style Adaptation Using Residual Error Embedding for Expressive Speech Synthesis.

Towards Multi-Scale Style Control for Expressive Speech Synthesis

BI-LEVEL STYLE AND PROSODY DECOUPLING MODELING FOR PERSONALIZED END-TO-END SPEECH SYNTHESIS

Feature Based Adaptation for Speaking Style Synthesis

Fine-grained style control in Transformer-based Text-to-speech Synthesis

Say Anything with Any Style

Integrating Discrete Word-Level Style Variations into Non-Autoregressive Acoustic Models for Speech Synthesis

Dynamic Soft Windowing and Language Dependent Style Token for Code-Switching End-to-End Speech Synthesis.