The DiffuseStyleGesture+ entry to the GENEA Challenge 2023

Sicheng Yang,Haiwei Xue,Zhensong Zhang,Minglei Li,Zhiyong Wu,Xiaofei Wu,Songcen Xu,Zonghong Dai
DOI: https://doi.org/10.1145/3577190.3616114
2023-08-26
Abstract:In this paper, we introduce the DiffuseStyleGesture+, our solution for the Generation and Evaluation of Non-verbal Behavior for Embodied Agents (GENEA) Challenge 2023, which aims to foster the development of realistic, automated systems for generating conversational gestures. Participants are provided with a pre-processed dataset and their systems are evaluated through crowdsourced scoring. Our proposed model, DiffuseStyleGesture+, leverages a diffusion model to generate gestures automatically. It incorporates a variety of modalities, including audio, text, speaker ID, and seed gestures. These diverse modalities are mapped to a hidden space and processed by a modified diffusion model to produce the corresponding gesture for a given speech input. Upon evaluation, the DiffuseStyleGesture+ demonstrated performance on par with the top-tier models in the challenge, showing no significant differences with those models in human-likeness, appropriateness for the interlocutor, and achieving competitive performance with the best model on appropriateness for agent speech. This indicates that our model is competitive and effective in generating realistic and appropriate gestures for given speech. The code, pre-trained models, and demos are available at <a class="link-external link-https" href="https://github.com/YoungSeng/DiffuseStyleGesture/tree/DiffuseStyleGesturePlus/BEAT-TWH-main" rel="external noopener nofollow">this https URL</a>.
Human-Computer Interaction,Artificial Intelligence,Multimedia
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of **dialogue - based natural gesture generation**, especially in the **Generation and Evaluation of Non - verbal Behavior for Embodied Agents (GENEA) Challenge 2023**. Specifically, the authors propose a model named **DiffuseStyleGesture+** to generate realistic and appropriate dialogue gestures that match the voice input. #### Main problems and challenges: 1. **Generating realistic dialogue gestures**: Existing methods face challenges in generating high - quality gestures, especially in ensuring the consistency with the voice while maintaining the diversity of gestures. 2. **Integration of multi - modal inputs**: In order to generate more natural gestures, it is necessary to integrate information of multiple modalities, such as audio, text, speaker identity, etc. The fusion of information between these different modalities is a complex problem. 3. **Diversity and controllability of gestures**: Not only high - quality gestures need to be generated, but also it is required to ensure that the generated gestures are diverse and can be controlled according to different input conditions. 4. **Evaluating the generated gestures**: How to objectively evaluate the quality of the generated gestures is also an important problem. The paper mentions that they use the method of crowdsourcing scoring for evaluation. ### Solutions: The authors propose the **DiffuseStyleGesture+** model, which is based on the diffusion model and can effectively handle multi - modal inputs and generate high - quality gestures. Specifically: - **Multi - modal inputs**: The model combines multiple input modalities such as audio, text, speaker identity, and seed gestures. - **Application of the diffusion model**: By using the diffusion model to generate gestures, it can generate diverse and controllable gestures while maintaining high quality. - **Local attention mechanism**: The cross - local attention mechanism is introduced to better handle the relationships between different modalities and ensure that the generated gestures are aligned with the voice. ### Summary: The main goal of this paper is to develop a gesture generation system that can generate high - quality, diverse, and voice - aligned gestures, especially in the dialogue scenario. By introducing the diffusion model and multi - modal inputs, the authors have solved the deficiencies of existing methods in gesture generation and demonstrated the effectiveness of their model in the GENEA Challenge 2023.