CLTalk: Speech-Driven 3D Facial Animation with Contrastive Learning

Xitie Zhang,Suping Wu
DOI: https://doi.org/10.1145/3652583.3657625
2024-01-01
Abstract:Speech-driven 3D facial animation aims to generate realistic and vivid 3D facial animations from speech. However, the scarcity of labeled data and the tendency of existing methods to treat this cross-modal mapping problem as a regression task can result in inadequate learning of discriminative features from the speech. This deficiency often leads to excessively smooth facial movements, particularly in lip movements. To address these issues and enhance the accuracy of lip generation while reducing reliance on labeled data, we propose CLTalk, a framework based on a contrastive learning strategy. This framework comprises three main parts: a temporal domain contrastive learning strategy that facilitates the learning of discriminative features from different audio frames, a correlation learning method that ensures consistency between the distribution of audio features and Mesh labels, and a mouth opening angle constraint method to further improve the accuracy of lip generation. Extensive experimental results on the challenging, widely evaluated datasets indicate the effectiveness of our method compared with the state of the arts.
What problem does this paper attempt to address?