Abstract:While previous audio-driven talking head generation (THG) methods generate head poses from driving audio, the generated poses or lips cannot match the audio well or are not editable. In this study, we propose \textbf{PoseTalk}, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio. The core insight of our method is using head pose to connect visual, linguistic, and audio signals. First, we propose to generate poses from both audio and text prompts, where the audio offers short-term variations and rhythm correspondence of the head movements and the text prompts describe the long-term semantics of head motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to generate motion latent from text prompts and audio cues in a pose latent space. Second, we observe a loss-imbalance problem: the loss for the lip region contributes less than 4\% of the total reconstruction loss caused by both pose and lip, making optimization lean towards head movements rather than lip shapes. To address this issue, we propose a refinement-based learning strategy to synthesize natural talking videos using two cascaded networks, i.e., CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce animated images in novel poses and the RefineNet focuses on learning finer lip motions by progressively estimating lip motions from low-to-high resolutions, yielding improved lip-synchronization performance. Experiments demonstrate our pose prediction strategy achieves better pose diversity and realness compared to text-only or audio-only, and our video generator model outperforms state-of-the-art methods in synthesizing talking videos with natural head motions. Project: <a class="link-external link-https" href="https://junleen.github.io/projects/posetalk" rel="external noopener nofollow">this https URL</a>.

Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation

SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

Audio-driven Talking Face Video Generation with Natural Head Pose

Spatially and Temporally Optimized Audio‐Driven Talking Face Generation

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

SyncTalklip: Highly Synchronized Lip-Readable Speaker Generation with Multi-Task Learning

Multimodal Learning for Temporally Coherent Talking Face Generation with Articulator Synergy

HyperLips: Hyper Control Lips with High Resolution Decoder for Talking Face Generation

Multimodal Inputs Driven Talking Face Generation With Spatial–Temporal Dependency

That's What I Said: Fully-Controllable Talking Face Generation

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation

One-Shot Talking Face Generation from Single-Speaker Audio-Visual Correlation Learning

Audio-driven Talking Face Generation with Stabilized Synchronization Loss

OpFlowTalker: Realistic and Natural Talking Face Generation via Optical Flow Guidance

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

Faces that Speak: Jointly Synthesising Talking Face and Speech from Text

Meta Talk: Learning To Data-Efficiently Generate Audio-Driven Lip-Synchronized Talking Face With High Definition

RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

Real-time speech-driven lip synchronization