Abstract:In recent years, the field of talking head generation has made significant strides. However, the need for substantial computational resources for model training, coupled with a scarcity of high-quality video data, poses challenges for the rapid customization of model to specific individual. Additionally, existing models usually only support single-modal control, lacking the ability to generate vivid facial expressions and controllable head poses based on multiple conditions such as audio, video, etc. These limitations restricts the models' widespread application. In this paper, we introduce a two-stage method called Control-Talker to achieve rapid customization of identity in talking head model and high-quality generation based on multimodal conditions. Specifically, we divide the training process into two stages: prior learning stage and identity rapid-customization stage. 1) In the prior learning stage, we leverage a diffusion-based model pre-trained on the high-quality image dataset to acquire a robust controllable facial prior. Meanwhile, we innovatively propose a high-frequency ControlNet structure to enhance the fidelity of the synthesized results. This structure adeptly extracts a high-frequency feature map from the source image, serving as a facial texture prior, thereby excellently preserving facial texture of the source image. 2) In the identity rapid-customization stage, the identity is fixed by fine-tuning the U-Net part of the diffusion model on merely several images of a specific individual. The entire fine-tuning process for identity customization can be completed within approximately ten minutes, thereby significantly reducing training costs. Further, we propose a unified driving method for both audio and video, enabling the model to precisely control expressions, poses, and lighting under multi conditions. Extensive experiments and visual results demonstrate that our method outperforms other state-of-the-art models. Additionally, our model demonstrates reduced training costs and lower data requirements.

Performance Comparison of ControlNet Models Based on PONY in Complex Human Pose Image Generation

Context-Guided Adaptive Network for Efficient Human Pose Estimation.

Real-Time Audio-Guided Multi-Face Reenactment

Gated Neural Network Framework for Interactive Character Control

Skip-and-Play: Depth-Driven Pose-Preserved Image Generation for Any Objects

FineControlNet: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection

Follow-Your-Pose v2: Multiple-Condition Guided Character Image Animation for Stable Pose Control

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

OmniControlNet: Dual-stage Integration for Conditional Image Generation

From Text to Pose to Image: Improving Diffusion Model Control and Quality

ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems

Multi Positive Contrastive Learning with Pose-Consistent Generated Images

HPnet: Hybrid Parallel Network for Human Pose Estimation

Control-Talker: A Rapid-Customization Talking Head Generation Method for Multi-Condition Control and High-Texture Enhancement

ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models

CONet: Crowd and occlusion-aware network for occluded human pose estimation

Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models

ControlNeXt: Powerful and Efficient Control for Image and Video Generation

GRPose: Learning Graph Relations for Human Image Generation with Pose Priors