Manitalk: manipulable talking head generation from single image in the wild

Fang, Hui,Tian, Zeyu,Ma, Yin
DOI: https://doi.org/10.1007/s00371-024-03490-4
IF: 2.835
2024-06-09
The Visual Computer
Abstract:Generating talking head videos through a face image and a piece of speech audio has gained widespread interest. Existing talking face synthesis methods typically lack the ability to generate manipulable facial details and pupils, which is desirable for producing stylized facial expressions. We present ManiTalk, the first manipulable audio-driven talking head generation system. Our system consists of three stages. In the first stage, the proposed Exp Generator and Pose Generator generate synchronized talking landmarks and presentation-style head poses. In the second stage, we parameterize the positions of eyebrows, eyelids, and pupils, enabling personalized and straightforward manipulation of facial details. In the last stage, we introduce SFWNet to warp facial images based on the landmark motions. Additional driving sketches are input to generate more precise expressions. Extensive quantitative and qualitative evaluations, along with user studies, demonstrate that the system can accurately manipulate facial details and achieve excellent lip synchronization. Our system achieves state-of-the-art performance in terms of identity preservation and video quality. Code is available at https://github.com/shanzhajuan/ManiTalk.
computer science, software engineering
What problem does this paper attempt to address?