Large Body Language Models

Saif Punjwani,Larry Heck
2024-10-22
Abstract:As virtual agents become increasingly prevalent in human-computer interaction, generating realistic and contextually appropriate gestures in real-time remains a significant challenge. While neural rendering techniques have made substantial progress with static scripts, their applicability to human-computer interactions remains limited. To address this, we introduce Large Body Language Models (LBLMs) and present LBLM-AVA, a novel LBLM architecture that combines a Transformer-XL large language model with a parallelized diffusion model to generate human-like gestures from multimodal inputs (text, audio, and video). LBLM-AVA incorporates several key components enhancing its gesture generation capabilities, such as multimodal-to-pose embeddings, enhanced sequence-to-sequence mapping with redefined attention mechanisms, a temporal smoothing module for gesture sequence coherence, and an attention-based refinement module for enhanced realism. The model is trained on our large-scale proprietary open-source dataset Allo-AVA. LBLM-AVA achieves state-of-the-art performance in generating lifelike and contextually appropriate gestures with a 30% reduction in Fréchet Gesture Distance (FGD), and a 25% improvement in Fréchet Inception Distance compared to existing approaches.
Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to address the critical challenge of generating realistic and contextually appropriate body language (gestures) in real-time human-computer interaction. Although neural rendering techniques have made significant progress in generating static scripts, their application in actual human-computer interaction remains limited. Specifically, existing methods have the following shortcomings: 1. **Static Movements**: Gestures generated by existing neural rendering techniques often appear mechanical and lack natural fluidity. 2. **Limited Multimodal Interaction Capability**: Existing methods struggle to handle dynamic dialogue scenarios involving multiple input modes such as text, audio, and video. 3. **Long-term Coherence and Adaptability**: Traditional methods find it difficult to maintain coherence when generating long sequences of gestures and to adapt to the continuously changing dialogue context. To address these issues, the authors introduce Large Body Language Models (LBLMs), specifically proposing a new architecture named LBLM-A V A. This model combines Transformer-XL and parallel diffusion models to generate realistic and contextually appropriate gestures in real-time from multimodal inputs (text, audio, and video). ### Main Contributions 1. **Novel Architecture**: LBLM-A V A combines Transformer-XL and parallel diffusion models to handle multimodal inputs and generate high-quality gestures. 2. **Multimodal to Pose Embedding**: Through the multimodal to pose embedding module, features from different modalities are mapped to the pose space, enhancing the model's generation capability. 3. **Temporal Smoothing Module**: A temporal smoothing module is introduced to ensure the generated gesture sequences are temporally coherent. 4. **Attention Mechanism**: The attention mechanism is redefined to enhance the sequence-to-sequence mapping capability. 5. **Adversarial Training**: Adversarial training methods are employed to further improve the realism and diversity of the generated gestures. 6. **Large-scale Dataset**: A large-scale multimodal dataset named Allo-A V A is developed, containing high-quality video, audio, and text data from various sources, providing rich resources for model training. ### Experimental Results Experimental evaluations indicate that LBLM-A V A achieves state-of-the-art performance in generating realistic and contextually appropriate gestures. Compared to existing methods, this model reduces the Fréchet Gesture Distance (FGD) and Fréchet Inception Distance (FID) metrics by 30% and 25%, respectively, and also excels in the diversity of generated gestures. ### Conclusion By introducing the LBLM-A V A model and the Allo-A V A dataset, this paper significantly advances the research on real-time multimodal gesture generation. These achievements not only enhance the naturalness and appeal of virtual agents in human-computer interaction but also open up new possibilities for the development of virtual assistants, social robots, remote presentation systems, and educational technologies. However, the authors also point out potential challenges in style and personality control of the model and suggest that future research directions could explore these aspects.