Implicit Memory-Based Variational Motion Talking Face Generation

Daowu Yang,Sheng Huang,Wen Jiang,Jin Zou
DOI: https://doi.org/10.1109/lsp.2024.3356415
2024-02-02
IEEE Signal Processing Letters
Abstract:Speech-driven facial animation is a challenging problem where each input audio can have multiple plausible facial outputs, leading to overly smooth results. Although the two-stage framework of audio-to-motion model and neural rendering models can partially mitigate this issue, it lacks crucial details like emotions and wrinkles. To overcome these limitations, we introduce a variational motion generator with implicit memory. By incorporating implicit memory into the audio-to-motion model, we capture high-level semantics in the shared latent space of audio expressions, resulting in accurate and expressive facial landmark generation. Next, we introduce attention with time bias to effectively maintain the consistency of audio motion and adopt a periodic position encoding strategy to provide summarization capability for longer audio sequences. Experimental results demonstrate that our approach outperforms previous methods, yielding more extensive and realistic speech-driven facial animation.
engineering, electrical & electronic
What problem does this paper attempt to address?