Abstract:While previous audio-driven talking head generation (THG) methods generate head poses from driving audio, the generated poses or lips cannot match the audio well or are not editable. In this study, we propose \textbf{PoseTalk}, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio. The core insight of our method is using head pose to connect visual, linguistic, and audio signals. First, we propose to generate poses from both audio and text prompts, where the audio offers short-term variations and rhythm correspondence of the head movements and the text prompts describe the long-term semantics of head motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to generate motion latent from text prompts and audio cues in a pose latent space. Second, we observe a loss-imbalance problem: the loss for the lip region contributes less than 4\% of the total reconstruction loss caused by both pose and lip, making optimization lean towards head movements rather than lip shapes. To address this issue, we propose a refinement-based learning strategy to synthesize natural talking videos using two cascaded networks, i.e., CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce animated images in novel poses and the RefineNet focuses on learning finer lip motions by progressively estimating lip motions from low-to-high resolutions, yielding improved lip-synchronization performance. Experiments demonstrate our pose prediction strategy achieves better pose diversity and realness compared to text-only or audio-only, and our video generator model outperforms state-of-the-art methods in synthesizing talking videos with natural head motions. Project: <a class="link-external link-https" href="https://junleen.github.io/projects/posetalk" rel="external noopener nofollow">this https URL</a>.

The Use of Dynamic Deformable Templates for Lip Tracking in an Audio-Visual Corpus with Large Variations in Head Pose, Face Illumination and Lip Shapes

Lip Reading Based on 3D Face Modeling and Spatial Transformation Learning

Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

Style-Preserving Lip Sync via Audio-Aware Style Reference

Video-audio Driven Real-Time Facial Animation.

PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation

Learning Audio-Driven Viseme Dynamics for 3D Face Animation

A Speech-Driven 3-D Lip Synthesis with Realistic Dynamics in Mandarin Chinese

Dynamic High Resolution Deformable Articulated Tracking

Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers

KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

Ada-Tracker: Soft Tissue Tracking via Inter-Frame and Adaptive-Template Matching

VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

Lip-Movement Features Extraction and Recognition Based on Chroma Analysis

A Probabilistic Dynamic Contour Model For Accurate And Robust Lip Tracking

LawDNet: Enhanced Audio-Driven Lip Synthesis via Local Affine Warping Deformation

Learning Dynamic Compact Memory Embedding for Deformable Visual Object Tracking

Learning the Relative Dynamic Features for Word-Level Lipreading

Displaced Dynamic Expression Regression for Real-Time Facial Tracking and Animation