LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning

Zhe Li,Weihao Yuan,Yisheng He,Lingteng Qiu,Shenhao Zhu,Xiaodong Gu,Weichao Shen,Yuan Dong,Zilong Dong,Laurence T. Yang
2024-10-10
Abstract:Language plays a vital role in the realm of human motion. Existing methods have largely depended on CLIP text embeddings for motion generation, yet they fall short in effectively aligning language and motion due to CLIP's pretraining on static image-text pairs. This work introduces LaMP, a novel Language-Motion Pretraining model, which transitions from a language-vision to a more suitable language-motion latent space. It addresses key limitations by generating motion-informative text embeddings, significantly enhancing the relevance and semantics of generated motion sequences. With LaMP, we advance three key tasks: text-to-motion generation, motion-text retrieval, and motion captioning through aligned language-motion representation learning. For generation, we utilize LaMP to provide the text condition instead of CLIP, and an autoregressive masked prediction is designed to achieve mask modeling without rank collapse in transformers. For retrieval, motion features from LaMP's motion transformer interact with query tokens to retrieve text features from the text transformer, and vice versa. For captioning, we finetune a large language model with the language-informative motion features to develop a strong motion captioning model. In addition, we introduce the LaMP-BertScore metric to assess the alignment of generated motions with textual descriptions. Extensive experimental results on multiple datasets demonstrate substantial improvements over previous methods across all three tasks. The code of our method will be made public.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is that existing methods are unable to effectively align language and motion representations when generating, retrieving, and describing human motions. Specifically, most of the existing methods rely on CLIP text embeddings to generate motions. However, CLIP is pre - trained on static image - text pairs, so it has deficiencies in capturing dynamic motion features. This results in a low semantic correlation between the generated motion sequences and the text descriptions, affecting the effectiveness of the task. To solve this problem, the authors propose the LaMP (Language - Motion Pretraining) model, which aims to transform the representations of language and motion from the static visual space to a more appropriate language - motion latent space. In this way, LaMP can generate more informative text embeddings, thereby significantly improving the relevance and semantic consistency of the generated motion sequences. In addition, LaMP also improves three key tasks: text - to - motion generation, motion - to - text retrieval, and motion caption generation. ### Main contributions: 1. **Proposing the LaMP model**: Extract text embeddings as conditional signals through the language - motion pre - training model to guide motion generation, ensuring that the generated motions are more in line with semantic information and reducing the gap between modalities. 2. **Designing the LaMP - T2M model**: Adopting an autoregressive mask prediction mechanism to alleviate the problem of decreased expressiveness caused by low - rank matrices during the training process and enhancing the information interaction within the masked area. 3. **Developing the LaMP - M2T model**: Using motion features rich in language information obtained from LaMP to fine - tune a large - language model (LLM) to achieve motion caption generation. 4. **Introducing the LaMP - BertScore evaluation metric**: Used to evaluate the degree of alignment between the generated motions and semantic information. Through these improvements, the experimental results of LaMP on multiple datasets show that it outperforms existing methods in text - to - motion generation, motion - to - text retrieval, and motion caption generation tasks. For example, on the HumanML3D dataset, the FID metric is reduced by 28.9%, and on the KIT - ML dataset, it is reduced by 28.0%.