Audio-Driven Stylized Gesture Generation with Flow-Based Model.

Sheng Ye,Yu-Hui Wen,Yanan Sun,Ying He,Ziyang Zhang,Yaoyuan Wang,Weihua He,Yong-Jin Liu
DOI: https://doi.org/10.1007/978-3-031-20065-6_41
2022-01-01
Abstract:Generating stylized audio-driven gestures for robots and virtual avatars has attracted increasing considerations recently. Existing methods require style labels (e.g. speaker identities), or complex preprocessing of data to obtain the style control parameters. In this paper, we propose a new end-to-end flow-based model, which can generate audio-driven gestures of arbitrary styles with neither preprocessing nor style labels. To achieve this goal, we introduce a global encoder and a gesture perceptual loss into the classic generative flow model to capture both global and local information. We conduct extensive experiments on two benchmark datasets: the TED Dataset and the Trinity Dataset. Both quantitative and qualitative evaluations show that the proposed model outperforms state-of-the-art models.
What problem does this paper attempt to address?