FastTalker: Jointly Generating Speech and Conversational Gestures from Text

Zixin Guo,Jian Zhang
2024-09-25
Abstract:Generating 3D human gestures and speech from a text script is critical for creating realistic talking avatars. One solution is to leverage separate pipelines for text-to-speech (TTS) and speech-to-gesture (STG), but this approach suffers from poor alignment of speech and gestures and slow inference times. In this paper, we introduce FastTalker, an efficient and effective framework that simultaneously generates high-quality speech audio and 3D human gestures at high inference speeds. Our key insight is reusing the intermediate features from speech synthesis for gesture generation, as these features contain more precise rhythmic information than features re-extracted from generated speech. Specifically, 1) we propose an end-to-end framework that concurrently generates speech waveforms and full-body gestures, using intermediate speech features such as pitch, onset, energy, and duration directly for gesture decoding; 2) we redesign the causal network architecture to eliminate dependencies on future inputs for real applications; 3) we employ Reinforcement Learning-based Neural Architecture Search (NAS) to enhance both performance and inference speed by optimizing our network architecture. Experimental results on the BEAT2 dataset demonstrate that FastTalker achieves state-of-the-art performance in both speech synthesis and gesture generation, processing speech and gestures in 0.17 seconds per second on an NVIDIA 3090.
Multimedia,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of generating realistic 3D human gestures and voices from text scripts. Specifically, the authors propose a framework named FastTalker, which can simultaneously generate high - quality voice audio and 3D human gestures and performs excellently in terms of inference speed. #### Main problems 1. **Deficiencies of existing methods**: - **Problems with separate pipelines**: Existing methods usually use independent pipelines to process text - to - speech (TTS) and speech - to - gesture (STG) separately, which leads to poor alignment between voice and gesture and overly long inference times. - **Lack of joint generation models**: Currently, there is no open - source work that can generate voice and gesture simultaneously from text scripts. 2. **Improving efficiency and effectiveness**: - **Alignment issues**: Generate gestures by reusing intermediate features (such as pitch, onset, energy, and duration) in voice synthesis. These features are more precise than those re - extracted from the generated voice. - **Requirements for real - time applications**: To achieve real - time applications, it is necessary to design a causal network architecture to eliminate dependence on future inputs and optimize the network architecture to improve inference speed. #### Solutions - **FastTalker framework**: - **End - to - end framework**: Simultaneously generate voice waveforms and full - body gestures, and directly use intermediate features in voice synthesis for gesture decoding. - **Causal network architecture**: Redesign the network architecture to support real - time applications and eliminate dependence on future inputs. - **Reinforcement learning neural architecture search (NAS)**: Improve performance and inference speed by optimizing network architecture hyper - parameters (such as the number of encoder and decoder layers, convolution kernel size, and channel dimension). #### Experimental results Experimental results show that FastTalker outperforms existing state - of - the - art methods on the BEAT2 dataset, achieving the best levels in both voice synthesis and gesture generation, and can process 0.17 seconds of voice and gesture per second on an NVIDIA 3090 GPU. Through these improvements, FastTalker not only improves the generation quality and efficiency but also provides new possibilities for creating realistic virtual characters and animations.