Abstract:Speech-driven gesture generation is an emerging domain within virtual human creation, where current methods predominantly utilize Transformer-based architectures that necessitate extensive memory and are characterized by slow inference speeds. In response to these limitations, we propose \textit{DiM-Gestures}, a novel end-to-end generative model crafted to create highly personalized 3D full-body gestures solely from raw speech audio, employing Mamba-based architectures. This model integrates a Mamba-based fuzzy feature extractor with a non-autoregressive Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mamba framework and a WavLM pre-trained model, autonomously derives implicit, continuous fuzzy features, which are then unified into a singular latent feature. This feature is processed by the AdaLN Mamba-2, which implements a uniform conditional mechanism across all tokens to robustly model the interplay between the fuzzy features and the resultant gesture sequence. This innovative approach guarantees high fidelity in gesture-speech synchronization while maintaining the naturalness of the gestures. Employing a diffusion model for training and inference, our framework has undergone extensive subjective and objective evaluations on the ZEGGS and BEAT datasets. These assessments substantiate our model's enhanced performance relative to contemporary state-of-the-art methods, demonstrating competitive outcomes with the DiTs architecture (Persona-Gestors) while optimizing memory usage and accelerating inference speed.

What problem does this paper attempt to address?

The paper aims to address the issue of co-speech gesture generation in the field of virtual human creation, particularly focusing on the limitations of existing methods such as high memory consumption and slow inference speed. The research team proposes a novel end-to-end generative model named DiM-Gesture, which can generate highly personalized 3D full-body co-speech gestures from raw speech audio. Specifically, the DiM-Gesture model combines a fuzzy feature extractor based on the Mamba architecture with a non-autoregressive adaptive layer normalization (AdaLN) Mamba-2 diffusion architecture. The fuzzy feature extractor leverages the Mamba framework and the pre-trained WavLM model to automatically derive implicit continuous fuzzy features, which are then integrated into a unified latent feature representation. This feature representation is processed by the AdaLN Mamba-2, a module that implements a unified conditioning mechanism across the entire token to robustly model the interaction between the fuzzy features and the resulting gesture sequence. The key contributions of the paper include: 1. Proposing an innovative Mamba-based fuzzy feature inference strategy that can synthesize a broader range of personalized gestures from speech audio without the need for style labels or additional input. 2. Integrating the AdaLN Mamba-2 architecture into the diffusion model, improving the modeling of the complex relationship between speech and gestures. The research results validate that the Mamba architecture can rival traditional Transformer architectures in gesture generation and achieve an optimal balance in terms of naturalness and synchronization. 3. Conducting extensive subjective and objective evaluations, confirming the superior performance of the model compared to current state-of-the-art methods, particularly in generating believable, speech-matched, and personalized gestures. In summary, the DiM-Gesture model aims to overcome the limitations of existing methods and provide an efficient and high-quality solution for co-speech gesture generation.

DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework

DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2

MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality Fusion

Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature Inference

DiT-Gesture: A Speech-Only Approach to Stylized Gesture Generation

MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Masked Audio Gesture Modeling

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling

Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model

UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons

EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation

C2G2: Controllable Co-speech Gesture Generation with Latent Diffusion Model

DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation

ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance