Abstract:As virtual agents become increasingly prevalent in human-computer interaction, generating realistic and contextually appropriate gestures in real-time remains a significant challenge. While neural rendering techniques have made substantial progress with static scripts, their applicability to human-computer interactions remains limited. To address this, we introduce Large Body Language Models (LBLMs) and present LBLM-AVA, a novel LBLM architecture that combines a Transformer-XL large language model with a parallelized diffusion model to generate human-like gestures from multimodal inputs (text, audio, and video). LBLM-AVA incorporates several key components enhancing its gesture generation capabilities, such as multimodal-to-pose embeddings, enhanced sequence-to-sequence mapping with redefined attention mechanisms, a temporal smoothing module for gesture sequence coherence, and an attention-based refinement module for enhanced realism. The model is trained on our large-scale proprietary open-source dataset Allo-AVA. LBLM-AVA achieves state-of-the-art performance in generating lifelike and contextually appropriate gestures with a 30% reduction in Fréchet Gesture Distance (FGD), and a 25% improvement in Fréchet Inception Distance compared to existing approaches.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address the critical challenge of generating realistic and contextually appropriate body language (gestures) in real-time human-computer interaction. Although neural rendering techniques have made significant progress in generating static scripts, their application in actual human-computer interaction remains limited. Specifically, existing methods have the following shortcomings: 1. **Static Movements**: Gestures generated by existing neural rendering techniques often appear mechanical and lack natural fluidity. 2. **Limited Multimodal Interaction Capability**: Existing methods struggle to handle dynamic dialogue scenarios involving multiple input modes such as text, audio, and video. 3. **Long-term Coherence and Adaptability**: Traditional methods find it difficult to maintain coherence when generating long sequences of gestures and to adapt to the continuously changing dialogue context. To address these issues, the authors introduce Large Body Language Models (LBLMs), specifically proposing a new architecture named LBLM-A V A. This model combines Transformer-XL and parallel diffusion models to generate realistic and contextually appropriate gestures in real-time from multimodal inputs (text, audio, and video). ### Main Contributions 1. **Novel Architecture**: LBLM-A V A combines Transformer-XL and parallel diffusion models to handle multimodal inputs and generate high-quality gestures. 2. **Multimodal to Pose Embedding**: Through the multimodal to pose embedding module, features from different modalities are mapped to the pose space, enhancing the model's generation capability. 3. **Temporal Smoothing Module**: A temporal smoothing module is introduced to ensure the generated gesture sequences are temporally coherent. 4. **Attention Mechanism**: The attention mechanism is redefined to enhance the sequence-to-sequence mapping capability. 5. **Adversarial Training**: Adversarial training methods are employed to further improve the realism and diversity of the generated gestures. 6. **Large-scale Dataset**: A large-scale multimodal dataset named Allo-A V A is developed, containing high-quality video, audio, and text data from various sources, providing rich resources for model training. ### Experimental Results Experimental evaluations indicate that LBLM-A V A achieves state-of-the-art performance in generating realistic and contextually appropriate gestures. Compared to existing methods, this model reduces the Fréchet Gesture Distance (FGD) and Fréchet Inception Distance (FID) metrics by 30% and 25%, respectively, and also excels in the diversity of generated gestures. ### Conclusion By introducing the LBLM-A V A model and the Allo-A V A dataset, this paper significantly advances the research on real-time multimodal gesture generation. These achievements not only enhance the naturalness and appeal of virtual agents in human-computer interaction but also open up new possibilities for the development of virtual assistants, social robots, remote presentation systems, and educational technologies. However, the authors also point out potential challenges in style and personality control of the model and suggest that future research directions could explore these aspects.

Large Body Language Models

Allo-AVA: A Large-Scale Multimodal Conversational AI Dataset for Allocentric Avatar Gesture Animation

LLM Knows Body Language, Too: Translating Speech Voices into Human Gestures

3D-VLA: A 3D Vision-Language-Action Generative World Model

Probing Language Models' Gesture Understanding for Enhanced Human-AI Interaction

Body of Her: A Preliminary Study on End-to-End Humanoid Agent

BodyShapeGPT: SMPL Body Shape Manipulation with LLMs

Generative Expressive Robot Behaviors using Large Language Models

LaMI: Large Language Models for Multi-Modal Human-Robot Interaction

Llanimation: Llama Driven Gesture Animation

Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents

LLM Gesticulator: Leveraging Large Language Models for Scalable and Controllable Co-Speech Gesture Synthesis

A Survey on Vision-Language-Action Models for Embodied AI

PaLM-E: An Embodied Multimodal Language Model

DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework

Large Language Model-Brained GUI Agents: A Survey

Large Action Models: From Inception to Implementation

Understanding Emotional Body Expressions via Large Language Models

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Learning Instruction-Guided Manipulation Affordance via Large Models for Embodied Robotic Tasks

Real-time Animation Generation and Control on Rigged Models via Large Language Models