Transformer Based Multi-model Fusion for 3D Facial Animation

Benwang Chen,Chunshui Luo,Haoqian Wang
DOI: https://doi.org/10.1109/cfasta57821.2023.10243300
2023-01-01
Abstract:The topic of 3D drivable facial animation has been extensively studied for many years. Due to the limited audiovisual data and the uncertainty of the problem, it is still challenging to generate faithful facial motions. Most existing research focuses mainly on using audio information, which easily leads to excessively smooth lip movements during the generation. Considering the progress of multimodal learning and the advantages of phoneme-based mapping-driven approaches. We combine the phoneme and text representation into the task to enhance the relationship between speech and animation. Specifically, we present a novel multi-model fusion Transformer for the 3D facial generation that learns the features from audio, text, and phonemes modality. Utilizing the intrinsic relationship between phonemes and corresponding mouth movements can constrain excessive smoothness in the mapping from audio to mouth movements. Using powerful language representation models to obtain contextual semantic information from the text can eliminate facial animation biases caused by individual speaking characteristics in the audio. Additionally, to explore the rich information between modalities, we also introduce an attention-based fusion mechanism to integrate related features. The qualitative and quantitative experiments proved our method’s effectiveness.
What problem does this paper attempt to address?