Cospeech body motion generation using a transformer

Zixiang Lu,Zhitong He,Jiale Hong,Ping Gao
DOI: https://doi.org/10.1007/s10489-024-05769-4
IF: 5.3
2024-09-20
Applied Intelligence
Abstract:Body language is a method for communicating across languages and cultures. Making good use of body motions in speech can enhance persuasiveness, improve personal charisma, and make speech more effective. Generating matching body motions for digital avatars and social robots based on content has become an important topic. In this paper, we propose a transformer-based network model to generate body motions from input speech. Our model includes an audio transformer encoder, motion transformer encoder, template variational autoencoder, cross-modal transformer encoder, and motion decoder. Additionally, we propose a novel evaluation metric for describing motion change trends in terms of distance. The experimental results show that the proposed model provides higher-quality motion generation results than state-of-the-art models. As indicated by visual skeleton motions, our results are more natural and realistic than those of other methods. Additionally, the generated motions yield superior results in terms of multiple evaluation metrics.
computer science, artificial intelligence
What problem does this paper attempt to address?