Abstract:Generating stylized talking head with diverse head motions is crucial for achieving natural-looking videos but still remains challenging. Previous works either adopt a regressive method to capture the speaking style, resulting in a coarse style that is averaged across all training data, or employ a universal network to synthesize videos with different styles which causes suboptimal performance. To address these, we propose a novel dynamic-weight method, namely Say Anything withAny Style (SAAS), which queries the discrete style representation via a generative model with a learned style codebook. Specifically, we develop a multi-task VQ-VAE that incorporates three closely related tasks to learn a style codebook as a prior for style extraction. This discrete prior, along with the generative model, enhances the precision and robustness when extracting the speaking styles of the given style clips. By utilizing the extracted style, a residual architecture comprising a canonical branch and style-specific branch is employed to predict the mouth shapes conditioned on any driving audio while transferring the speaking style from the source to any desired one. To adapt to different speaking styles, we steer clear of employing a universal network by exploring an elaborate HyperStyle to produce the style-specific weights offset for the style branch. Furthermore, we construct a pose generator and a pose codebook to store the quantized pose representation, allowing us to sample diverse head motions aligned with the audio and the extracted style. Experiments demonstrate that our approach surpasses state-of-theart methods in terms of both lip-synchronization and stylized expression. Besides, we extend our SAAS to video-driven style editing field and achieve satisfactory performance.

Style-A-Video: Agile Diffusion for Arbitrary Text-Based Video Style Transfer

Correlation-based and Content-Enhanced Network for Video Style Transfer

UATST: Towards Unpaired Arbitrary Text-Guided Style Transfer with Cross-Space Modulation

Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion

TeSTNeRF: Text-Driven 3D Style Transfer Via Cross-Modal Learning.

Real-time Arbitrary Video Style Transfer

StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models

Stable Video Style Transfer Based on Partial Convolution with Depth-Aware Supervision

Consistent Video Style Transfer Via Compound Regularization.

StableVideo: Text-driven Consistency-aware Diffusion Video Editing

Cvstgan: A Controllable Generative Adversarial Network for Video Style Transfer of Chinese Painting

Structure-Guided Arbitrary Style Transfer for Artistic Image and Video

DiffStyler: Diffusion-based Localized Image Style Transfer

ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

Semantic-Aware Video Style Transfer Based on Temporal Consistent Sparse Patch Constraint.

Real-time Localized Photorealistic Video Style Transfer

Say Anything with Any Style