Sonic: Shifting Focus to Global Audio Perception in Portrait Animation

Xiaozhong Ji,Xiaobin Hu,Zhihong Xu,Junwei Zhu,Chuming Lin,Qingdong He,Jiangning Zhang,Donghao Luo,Yi Chen,Qin Lin,Qinglin Lu,Chengjie Wang
2024-11-25
Abstract:The study of talking face generation mainly explores the intricacies of synchronizing facial movements and crafting visually appealing, temporally-coherent animations. However, due to the limited exploration of global audio perception, current approaches predominantly employ auxiliary visual and spatial knowledge to stabilize the movements, which often results in the deterioration of the naturalness and temporal <a class="link-external link-http" href="http://inconsistencies.Considering" rel="external noopener nofollow">this http URL</a> the essence of audio-driven animation, the audio signal serves as the ideal and unique priors to adjust facial expressions and lip movements, without resorting to interference of any visual signals. Based on this motivation, we propose a novel paradigm, dubbed as Sonic, to {s}hift f{o}cus on the exploration of global audio per{c}ept{i}o{n}.To effectively leverage global audio knowledge, we disentangle it into intra- and inter-clip audio perception and collaborate with both aspects to enhance overall <a class="link-external link-http" href="http://perception.For" rel="external noopener nofollow">this http URL</a> the intra-clip audio perception, 1). \textbf{Context-enhanced audio learning}, in which long-range intra-clip temporal audio knowledge is extracted to provide facial expression and lip motion priors implicitly expressed as the tone and speed of speech. 2). \textbf{Motion-decoupled controller}, in which the motion of the head and expression movement are disentangled and independently controlled by intra-audio clips. Most importantly, for inter-clip audio perception, as a bridge to connect the intra-clips to achieve the global perception, \textbf{Time-aware position shift fusion}, in which the global inter-clip audio information is considered and fused for long-audio inference via through consecutively time-aware shifted windows. Extensive experiments demonstrate that the novel audio-driven paradigm outperform existing SOTA methodologies in terms of video quality, temporally consistency, lip synchronization precision, and motion diversity.
Multimedia,Graphics,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are several key issues in the current audio - driven facial animation generation technology: 1. **Accurate audio control and temporal coherence have not been effectively achieved yet**: When dealing with audio control and temporal coherence, existing methods usually handle these two problems completely separately, ignoring the overall coordination between audio and vision. These methods often rely on timestamp - segmented audio features and match them with each visual frame, which limits the optimized transformation of motion representation, resulting in reduced naturalness and temporal inconsistency. 2. **Lack of exploration of global audio perception**: Current methods mainly rely on auxiliary visual and spatial knowledge to stabilize motion, which often leads to reduced naturalness and temporal inconsistency. However, essentially, the audio signal, as the ideal and sole prior for adjusting facial expressions and lip movements, has not received sufficient attention. To solve these problems, the paper proposes a new paradigm - Sonic, which focuses on exploring global audio perception rather than relying on motion frames and other visual motions. Sonic achieves this goal through the following three main aspects: - **Context - enhanced audio learning**: Extract audio - temporal knowledge within the input audio segment through the long - range audio learning module and map it into temporal embeddings for subsequent audio - temporal cross - attention fusion. - **Motion - decoupling controller**: Decouple habitual head and expression motions and support independent control through two explicit parameters learned from the current audio segment. - **Temporally - aware position - offset fusion**: Continuously bridge the previous segment through a temporally - aware sliding window, expanding local audio perception to global cross - segment audio perception, thereby significantly enhancing the temporal modeling ability. These innovations aim to improve the quality, temporal coherence, lip - sync precision, and motion diversity of the generated videos. Through extensive experiments, it has been proven that Sonic is superior to the existing state - of - the - art methods in these aspects.