Abstract:Language plays a vital role in the realm of human motion. Existing methods have largely depended on CLIP text embeddings for motion generation, yet they fall short in effectively aligning language and motion due to CLIP's pretraining on static image-text pairs. This work introduces LaMP, a novel Language-Motion Pretraining model, which transitions from a language-vision to a more suitable language-motion latent space. It addresses key limitations by generating motion-informative text embeddings, significantly enhancing the relevance and semantics of generated motion sequences. With LaMP, we advance three key tasks: text-to-motion generation, motion-text retrieval, and motion captioning through aligned language-motion representation learning. For generation, we utilize LaMP to provide the text condition instead of CLIP, and an autoregressive masked prediction is designed to achieve mask modeling without rank collapse in transformers. For retrieval, motion features from LaMP's motion transformer interact with query tokens to retrieve text features from the text transformer, and vice versa. For captioning, we finetune a large language model with the language-informative motion features to develop a strong motion captioning model. In addition, we introduce the LaMP-BertScore metric to assess the alignment of generated motions with textual descriptions. Extensive experimental results on multiple datasets demonstrate substantial improvements over previous methods across all three tasks. The code of our method will be made public.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is that existing methods are unable to effectively align language and motion representations when generating, retrieving, and describing human motions. Specifically, most of the existing methods rely on CLIP text embeddings to generate motions. However, CLIP is pre - trained on static image - text pairs, so it has deficiencies in capturing dynamic motion features. This results in a low semantic correlation between the generated motion sequences and the text descriptions, affecting the effectiveness of the task. To solve this problem, the authors propose the LaMP (Language - Motion Pretraining) model, which aims to transform the representations of language and motion from the static visual space to a more appropriate language - motion latent space. In this way, LaMP can generate more informative text embeddings, thereby significantly improving the relevance and semantic consistency of the generated motion sequences. In addition, LaMP also improves three key tasks: text - to - motion generation, motion - to - text retrieval, and motion caption generation. ### Main contributions: 1. **Proposing the LaMP model**: Extract text embeddings as conditional signals through the language - motion pre - training model to guide motion generation, ensuring that the generated motions are more in line with semantic information and reducing the gap between modalities. 2. **Designing the LaMP - T2M model**: Adopting an autoregressive mask prediction mechanism to alleviate the problem of decreased expressiveness caused by low - rank matrices during the training process and enhancing the information interaction within the masked area. 3. **Developing the LaMP - M2T model**: Using motion features rich in language information obtained from LaMP to fine - tune a large - language model (LLM) to achieve motion caption generation. 4. **Introducing the LaMP - BertScore evaluation metric**: Used to evaluate the degree of alignment between the generated motions and semantic information. Through these improvements, the experimental results of LaMP on multiple datasets show that it outperforms existing methods in text - to - motion generation, motion - to - text retrieval, and motion caption generation tasks. For example, on the HumanML3D dataset, the FID metric is reduced by 28.9%, and on the KIT - ML dataset, it is reduced by 28.0%.

LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

MLUG: Bootstrapping Language-Motion Pre-Training for Unified Motion-Language Understanding and Generation

MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

MotionGPT: Human Motion as a Foreign Language

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation

LangSAMP: Language-Script Aware Multilingual Pretraining

Human Motion Instruction Tuning

Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs

MotionCLIP: Exposing Human Motion Generation to CLIP Space

LocoMotion: Learning Motion-Focused Video-Language Representations

What If We Recaption Billions of Web Images with LLaMA-3?

Plan, Posture and Go: Towards Open-vocabulary Text-to-Motion Generation

Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Action Generation

Language-Assisted Human Part Motion Learning for Skeleton-Based Temporal Action Segmentation

MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

Sign Language Production with Latent Motion Transformer

DreamLIP: Language-Image Pre-training with Long Captions

FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models