Self-Supervised Emotion Representation Disentanglement for Speech-Preserving Facial Expression Manipulation

Zhihua Xu,Tianshui Chen,Zhijing Yang,Chunmei Qing,Yukai Shi,Liang Lin
DOI: https://doi.org/10.1145/3664647.3681017
2024-01-01
Abstract:Speech-preserving Facial Expression Manipulation (SPFEM) aims to alter facial emotions in video content while preserving the facial movements associated with speech. Current works often fall short due to the inadequate representation of emotion as well as the absence of time-aligned paired data-two corresponding frames from the same speaker that showcase the same speech content but differ in emotional expression. In this work, we introduce a novel framework, Self-Supervised Emotion Representation Disentanglement (SSERD), to disentangle emotion representation for accurate emotion transfer while implementing a paired data construction module to facilitate automated, photorealistic facial animations. Specifically, We developed a module for learning emotion latent codes using StyleGAN's latent space, employing a cross-attention mechanism to extract and predict emotion editing codes, with contrastive learning to differentiate emotions. To overcome the lack of strictly paired data in the SPFEM task, we exploit pretrained StyleGAN to generate paired data, focusing on expression vectors unrelated to mouth shape. Additionally, we employed a hybrid training strategy using both synthetic paired and real unpaired data to enhance the realism of SPFEM model's generated images. Extensive experiments conducted on benchmark datasets, including MEAD and RAVDESS, have validated the effectiveness of our framework, demonstrating its superior capability in generating photorealistic and expressive facial animations.
What problem does this paper attempt to address?