Abstract:While previous audio-driven talking head generation (THG) methods generate head poses from driving audio, the generated poses or lips cannot match the audio well or are not editable. In this study, we propose \textbf{PoseTalk}, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio. The core insight of our method is using head pose to connect visual, linguistic, and audio signals. First, we propose to generate poses from both audio and text prompts, where the audio offers short-term variations and rhythm correspondence of the head movements and the text prompts describe the long-term semantics of head motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to generate motion latent from text prompts and audio cues in a pose latent space. Second, we observe a loss-imbalance problem: the loss for the lip region contributes less than 4\% of the total reconstruction loss caused by both pose and lip, making optimization lean towards head movements rather than lip shapes. To address this issue, we propose a refinement-based learning strategy to synthesize natural talking videos using two cascaded networks, i.e., CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce animated images in novel poses and the RefineNet focuses on learning finer lip motions by progressively estimating lip motions from low-to-high resolutions, yielding improved lip-synchronization performance. Experiments demonstrate our pose prediction strategy achieves better pose diversity and realness compared to text-only or audio-only, and our video generator model outperforms state-of-the-art methods in synthesizing talking videos with natural head motions. Project: <a class="link-external link-https" href="https://junleen.github.io/projects/posetalk" rel="external noopener nofollow">this https URL</a>.

Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance

Follow-Your-Pose v2: Multiple-Condition Guided Character Image Animation for Stable Pose Control

Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control

Supervised Video-To-Video Synthesis For Single Human Pose Transfer

Text2Performer: Text-Driven Human Video Generation.

Pose Guided Human Video Generation

Make It Move: Controllable Image-to-Video Generation with Text Descriptions

Zero-shot High-fidelity and Pose-controllable Character Animation

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation

Human Motion Transfer from Poses in the Wild

Text-Animator: Controllable Visual Text Video Generation

ControlVideo: Training-free Controllable Text-to-Video Generation

PoseVocab: Learning Joint-structured Pose Embeddings for Human Avatar Modeling

MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation

T2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated Data