Abstract:In this paper, we introduce the DiffuseStyleGesture+, our solution for the Generation and Evaluation of Non-verbal Behavior for Embodied Agents (GENEA) Challenge 2023, which aims to foster the development of realistic, automated systems for generating conversational gestures. Participants are provided with a pre-processed dataset and their systems are evaluated through crowdsourced scoring. Our proposed model, DiffuseStyleGesture+, leverages a diffusion model to generate gestures automatically. It incorporates a variety of modalities, including audio, text, speaker ID, and seed gestures. These diverse modalities are mapped to a hidden space and processed by a modified diffusion model to produce the corresponding gesture for a given speech input. Upon evaluation, the DiffuseStyleGesture+ demonstrated performance on par with the top-tier models in the challenge, showing no significant differences with those models in human-likeness, appropriateness for the interlocutor, and achieving competitive performance with the best model on appropriateness for agent speech. This indicates that our model is competitive and effective in generating realistic and appropriate gestures for given speech. The code, pre-trained models, and demos are available at <a class="link-external link-https" href="https://github.com/YoungSeng/DiffuseStyleGesture/tree/DiffuseStyleGesturePlus/BEAT-TWH-main" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of **dialogue - based natural gesture generation**, especially in the **Generation and Evaluation of Non - verbal Behavior for Embodied Agents (GENEA) Challenge 2023**. Specifically, the authors propose a model named **DiffuseStyleGesture+** to generate realistic and appropriate dialogue gestures that match the voice input. #### Main problems and challenges: 1. **Generating realistic dialogue gestures**: Existing methods face challenges in generating high - quality gestures, especially in ensuring the consistency with the voice while maintaining the diversity of gestures. 2. **Integration of multi - modal inputs**: In order to generate more natural gestures, it is necessary to integrate information of multiple modalities, such as audio, text, speaker identity, etc. The fusion of information between these different modalities is a complex problem. 3. **Diversity and controllability of gestures**: Not only high - quality gestures need to be generated, but also it is required to ensure that the generated gestures are diverse and can be controlled according to different input conditions. 4. **Evaluating the generated gestures**: How to objectively evaluate the quality of the generated gestures is also an important problem. The paper mentions that they use the method of crowdsourcing scoring for evaluation. ### Solutions: The authors propose the **DiffuseStyleGesture+** model, which is based on the diffusion model and can effectively handle multi - modal inputs and generate high - quality gestures. Specifically: - **Multi - modal inputs**: The model combines multiple input modalities such as audio, text, speaker identity, and seed gestures. - **Application of the diffusion model**: By using the diffusion model to generate gestures, it can generate diverse and controllable gestures while maintaining high quality. - **Local attention mechanism**: The cross - local attention mechanism is introduced to better handle the relationships between different modalities and ensure that the generated gestures are aligned with the voice. ### Summary: The main goal of this paper is to develop a gesture generation system that can generate high - quality, diverse, and voice - aligned gestures, especially in the dialogue scenario. By introducing the diffusion model and multi - modal inputs, the authors have solved the deficiencies of existing methods in gesture generation and demonstrated the effectiveness of their model in the GENEA Challenge 2023.

The DiffuseStyleGesture+ entry to the GENEA Challenge 2023

DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models

The ReprGesture Entry to the GENEA Challenge 2022

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation

Cultural Self-Adaptive Multimodal Gesture Generation Based on Multiple Culture Gesture Dataset

The GENEA Challenge 2023: A large scale evaluation of gesture generation models in monadic and dyadic settings

ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance

GestureMaster: Graph-based Speech-driven Gesture Generation

Evaluating gesture generation in a large-scale open challenge: The GENEA Challenge 2022

A Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020

C2G2: Controllable Co-speech Gesture Generation with Latent Diffusion Model

DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents

Conversational Co-Speech Gesture Generation via Modeling Dialog Intention, Emotion, and Context with Diffusion Models

UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons

CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild