Abstract:In recent years, the field of talking head generation has made significant strides. However, the need for substantial computational resources for model training, coupled with a scarcity of high-quality video data, poses challenges for the rapid customization of model to specific individual. Additionally, existing models usually only support single-modal control, lacking the ability to generate vivid facial expressions and controllable head poses based on multiple conditions such as audio, video, etc. These limitations restricts the models' widespread application. In this paper, we introduce a two-stage method called Control-Talker to achieve rapid customization of identity in talking head model and high-quality generation based on multimodal conditions. Specifically, we divide the training process into two stages: prior learning stage and identity rapid-customization stage. 1) In the prior learning stage, we leverage a diffusion-based model pre-trained on the high-quality image dataset to acquire a robust controllable facial prior. Meanwhile, we innovatively propose a high-frequency ControlNet structure to enhance the fidelity of the synthesized results. This structure adeptly extracts a high-frequency feature map from the source image, serving as a facial texture prior, thereby excellently preserving facial texture of the source image. 2) In the identity rapid-customization stage, the identity is fixed by fine-tuning the U-Net part of the diffusion model on merely several images of a specific individual. The entire fine-tuning process for identity customization can be completed within approximately ten minutes, thereby significantly reducing training costs. Further, we propose a unified driving method for both audio and video, enabling the model to precisely control expressions, poses, and lighting under multi conditions. Extensive experiments and visual results demonstrate that our method outperforms other state-of-the-art models. Additionally, our model demonstrates reduced training costs and lower data requirements.

Control-Talker: A Rapid-Customization Talking Head Generation Method for Multi-Condition Control and High-Texture Enhancement

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Audio-driven Talking Face Video Generation with Natural Head Pose

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

Controllable Talking Face Generation by Implicit Facial Keypoints Editing

High-Fidelity and Freely Controllable Talking Head Video Generation

Meta Talk: Learning To Data-Efficiently Generate Audio-Driven Lip-Synchronized Talking Face With High Definition

TellMeTalk: Multimodal-driven talking face video generation

Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads

DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser

OPT: One-shot Pose-Controllable Talking Head Generation

Multi-Modal Driven Pose-Controllable Talking Head Generation

StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles

Dynamic Neural Textures: Generating Talking-Face Videos with Continuously Controllable Expressions

Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network