Abstract:Generating stylized talking head with diverse head motions is crucial for achieving natural-looking videos but still remains challenging. Previous works either adopt a regressive method to capture the speaking style, resulting in a coarse style that is averaged across all training data, or employ a universal network to synthesize videos with different styles which causes suboptimal performance. To address these, we propose a novel dynamic-weight method, namely Say Anything withAny Style (SAAS), which queries the discrete style representation via a generative model with a learned style codebook. Specifically, we develop a multi-task VQ-VAE that incorporates three closely related tasks to learn a style codebook as a prior for style extraction. This discrete prior, along with the generative model, enhances the precision and robustness when extracting the speaking styles of the given style clips. By utilizing the extracted style, a residual architecture comprising a canonical branch and style-specific branch is employed to predict the mouth shapes conditioned on any driving audio while transferring the speaking style from the source to any desired one. To adapt to different speaking styles, we steer clear of employing a universal network by exploring an elaborate HyperStyle to produce the style-specific weights offset for the style branch. Furthermore, we construct a pose generator and a pose codebook to store the quantized pose representation, allowing us to sample diverse head motions aligned with the audio and the extracted style. Experiments demonstrate that our approach surpasses state-of-theart methods in terms of both lip-synchronization and stylized expression. Besides, we extend our SAAS to video-driven style editing field and achieve satisfactory performance.

Rapid Style Adaptation Using Residual Error Embedding for Expressive Speech Synthesis.

Feature Based Adaptation for Speaking Style Synthesis

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis

Using Multiple Reference Audios and Style Embedding Constraints for Speech Synthesis

Style Mixture of Experts for Expressive Text-To-Speech Synthesis

StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis

Investigating Deep Neural Network Adaptation for Generating Exclamatory and Interrogative Speech in Mandarin

Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised Style Extractor and Hierarchical Modeling in Speech Synthesis

MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis

Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

Exploring synthetic data for cross-speaker style transfer in style representation based TTS

Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis

Adaptive Text to Speech for Spontaneous Style

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Towards Multi-Scale Style Control for Expressive Speech Synthesis

Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

Unsupervised Multi-scale Expressive Speaking Style Modeling with Hierarchical Context Information for Audiobook Speech Synthesis.

Say Anything with Any Style

Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis