Abstract:Generating 3D human gestures and speech from a text script is critical for creating realistic talking avatars. One solution is to leverage separate pipelines for text-to-speech (TTS) and speech-to-gesture (STG), but this approach suffers from poor alignment of speech and gestures and slow inference times. In this paper, we introduce FastTalker, an efficient and effective framework that simultaneously generates high-quality speech audio and 3D human gestures at high inference speeds. Our key insight is reusing the intermediate features from speech synthesis for gesture generation, as these features contain more precise rhythmic information than features re-extracted from generated speech. Specifically, 1) we propose an end-to-end framework that concurrently generates speech waveforms and full-body gestures, using intermediate speech features such as pitch, onset, energy, and duration directly for gesture decoding; 2) we redesign the causal network architecture to eliminate dependencies on future inputs for real applications; 3) we employ Reinforcement Learning-based Neural Architecture Search (NAS) to enhance both performance and inference speed by optimizing our network architecture. Experimental results on the BEAT2 dataset demonstrate that FastTalker achieves state-of-the-art performance in both speech synthesis and gesture generation, processing speech and gestures in 0.17 seconds per second on an NVIDIA 3090.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of generating realistic 3D human gestures and voices from text scripts. Specifically, the authors propose a framework named FastTalker, which can simultaneously generate high - quality voice audio and 3D human gestures and performs excellently in terms of inference speed. #### Main problems 1. **Deficiencies of existing methods**: - **Problems with separate pipelines**: Existing methods usually use independent pipelines to process text - to - speech (TTS) and speech - to - gesture (STG) separately, which leads to poor alignment between voice and gesture and overly long inference times. - **Lack of joint generation models**: Currently, there is no open - source work that can generate voice and gesture simultaneously from text scripts. 2. **Improving efficiency and effectiveness**: - **Alignment issues**: Generate gestures by reusing intermediate features (such as pitch, onset, energy, and duration) in voice synthesis. These features are more precise than those re - extracted from the generated voice. - **Requirements for real - time applications**: To achieve real - time applications, it is necessary to design a causal network architecture to eliminate dependence on future inputs and optimize the network architecture to improve inference speed. #### Solutions - **FastTalker framework**: - **End - to - end framework**: Simultaneously generate voice waveforms and full - body gestures, and directly use intermediate features in voice synthesis for gesture decoding. - **Causal network architecture**: Redesign the network architecture to support real - time applications and eliminate dependence on future inputs. - **Reinforcement learning neural architecture search (NAS)**: Improve performance and inference speed by optimizing network architecture hyper - parameters (such as the number of encoder and decoder layers, convolution kernel size, and channel dimension). #### Experimental results Experimental results show that FastTalker outperforms existing state - of - the - art methods on the BEAT2 dataset, achieving the best levels in both voice synthesis and gesture generation, and can process 0.17 seconds of voice and gesture per second on an NVIDIA 3090 GPU. Through these improvements, FastTalker not only improves the generation quality and efficiency but also provides new possibilities for creating realistic virtual characters and animations.

FastTalker: Jointly Generating Speech and Conversational Gestures from Text

Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature Inference

Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness

Learning Speech-driven 3D Conversational Gestures from Video

Audio-driven Talking Face Video Generation with Natural Head Pose

Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation

DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework

Generating coherent spontaneous speech and gesture from text

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Speech gesture generation from the trimodal context of text, audio, and speaker identity

Unified speech and gesture synthesis using flow matching

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism

FastSpeech: Fast, Robust and Controllable Text to Speech

Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents