Dual-Path Transformer-Based GAN for Co-speech Gesture Synthesis

Xinyuan Qian,Hao Tang,Jichen Yang,Hongxu Zhu,Xu-Cheng Yin
DOI: https://doi.org/10.1007/s12369-024-01136-y
IF: 3.802
2024-05-15
International Journal of Social Robotics
Abstract:Co-speech gestures have significant impacts on conveying information. For social agents, producing realistic and smooth gestures are crucial to enable natural interactions with humans, which is a challenging task depending on many impact factors (e.g., speech audio, content, and the interacting person). In this paper, we tackle the cross-modal fusion problem through a novel fusion mechanism for end-to-end learning-based co-speech gesture generation. In particular, we facilitate parallel directional cross-modal transformers, and an interactive and cascaded 2D attention module, to achieve selective fusion of the gesture-related cues. Besides, we propose new metrics to evaluate gesture diversity and speech-gesture correspondence, without 3D pose annotation requirements. Experiments on a public dataset indicate that the proposed method can successfully produce diverse human-like poses, which outperform the other competitive state-of-the-art methods, with the evaluations conducted both objectively and subjectively.
robotics
What problem does this paper attempt to address?