Abstract:The study of talking face generation mainly explores the intricacies of synchronizing facial movements and crafting visually appealing, temporally-coherent animations. However, due to the limited exploration of global audio perception, current approaches predominantly employ auxiliary visual and spatial knowledge to stabilize the movements, which often results in the deterioration of the naturalness and temporal <a class="link-external link-http" href="http://inconsistencies.Considering" rel="external noopener nofollow">this http URL</a> the essence of audio-driven animation, the audio signal serves as the ideal and unique priors to adjust facial expressions and lip movements, without resorting to interference of any visual signals. Based on this motivation, we propose a novel paradigm, dubbed as Sonic, to {s}hift f{o}cus on the exploration of global audio per{c}ept{i}o{n}.To effectively leverage global audio knowledge, we disentangle it into intra- and inter-clip audio perception and collaborate with both aspects to enhance overall <a class="link-external link-http" href="http://perception.For" rel="external noopener nofollow">this http URL</a> the intra-clip audio perception, 1). \textbf{Context-enhanced audio learning}, in which long-range intra-clip temporal audio knowledge is extracted to provide facial expression and lip motion priors implicitly expressed as the tone and speed of speech. 2). \textbf{Motion-decoupled controller}, in which the motion of the head and expression movement are disentangled and independently controlled by intra-audio clips. Most importantly, for inter-clip audio perception, as a bridge to connect the intra-clips to achieve the global perception, \textbf{Time-aware position shift fusion}, in which the global inter-clip audio information is considered and fused for long-audio inference via through consecutively time-aware shifted windows. Extensive experiments demonstrate that the novel audio-driven paradigm outperform existing SOTA methodologies in terms of video quality, temporally consistency, lip synchronization precision, and motion diversity.

A Comprehensive Review and Taxonomy of Audio-Visual Synchronization Techniques for Realistic Speech Animation

Audio-Synchronized Visual Animation

Text-driven Visual Prosody Generation for Embodied Conversational Agents

Real-time speech-driven lip synchronization

Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis

Sonic: Shifting Focus to Global Audio Perception in Portrait Animation

SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis

Synchronising audio and ultrasound by learning cross-modal embeddings

A Review of Text-to-Visual Speech Synthesis

VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation

Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis

Content and Style Aware Audio-Driven Facial Animation

Towards Streaming Speech-to-Avatar Synthesis

Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert

Audio-driven Talking Face Generation with Stabilized Synchronization Loss

Learning Audio-Driven Viseme Dynamics for 3D Face Animation

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores

Synchronizing Audio-Visual Film Stimuli in Unity (version 5.5.1f1): Game Engines as a Tool for Research

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation