LLM Gesticulator: Leveraging Large Language Models for Scalable and Controllable Co-Speech Gesture Synthesis

Haozhou Pang,Tianwei Ding,Lanshan He,Ming Tao,Lu Zhang,Qi Gan
2024-10-22
Abstract:In this work, we present LLM Gesticulator, an LLM-based audio-driven co-speech gesture generation framework that synthesizes full-body animations that are rhythmically aligned with the input audio while exhibiting natural movements and editability. Compared to previous work, our model demonstrates substantial scalability. As the size of the backbone LLM model increases, our framework shows proportional improvements in evaluation metrics (a.k.a. scaling law). Our method also exhibits strong controllability where the content, style of the generated gestures can be controlled by text prompt. To the best of our knowledge, LLM gesticulator is the first work that use LLM on the co-speech generation task. Evaluation with existing objective metrics and user studies indicate that our framework outperforms prior works.
Graphics,Artificial Intelligence,Computation and Language,Machine Learning,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to generate natural and controllable full - body co - speech gestures that are synchronized with speech. Specifically, the author proposes a framework based on large - scale language models (LLMs) - LLM Gesticulator, which is used to generate high - quality full - body animations according to the input audio and text prompts. These animations are rhythmically aligned with the input audio and exhibit natural movements and editing capabilities. Compared with previous work, this framework shows significant scalability. As the size of the underlying LLM model increases, the evaluation metrics also show corresponding improvements (the so - called "scaling law"). In addition, this method also has strong controllability. The content and style of the generated gestures can be controlled through text prompts. This is the first work to use LLMs for co - speech generation tasks. The main contributions of the paper include: - Proposing a framework for generating full - body (body + hand) gestures based on large - language models, which performs better than previous work in existing evaluation metrics and user studies. - Proposing a training scheme that supports controllable gesture generation based on text prompts. - Proposing a new data - enhancement paradigm, using the rendering engine and VLLM model to label the action descriptions in the BEAT dataset and providing them to the community for future research.