Abstract:Motion generation from discrete quantization offers many advantages over continuous regression, but at the cost of inevitable approximation errors. Previous methods usually quantize the entire body pose into one code, which not only faces the difficulty in encoding all joints within one vector but also loses the spatial relationship between different joints. Differently, in this work we quantize each individual joint into one vector, which i) simplifies the quantization process as the complexity associated with a single joint is markedly lower than that of the entire pose; ii) maintains a spatial-temporal structure that preserves both the spatial relationships among joints and the temporal movement patterns; iii) yields a 2D token map, which enables the application of various 2D operations widely used in 2D images. Grounded in the 2D motion quantization, we build a spatial-temporal modeling framework, where 2D joint VQVAE, temporal-spatial 2D masking technique, and spatial-temporal 2D attention are proposed to take advantage of spatial-temporal signals among the 2D tokens. Extensive experiments demonstrate that our method significantly outperforms previous methods across different datasets, with a $26.6\%$ decrease of FID on HumanML3D and a $29.9\%$ decrease on KIT-ML.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of generating human motion sequences based on text prompts. Specifically, researchers are concerned with how to generate motions from the perspective of discrete quantization rather than the traditional continuous regression. This transformation can simplify the task, but it also introduces inevitable approximation errors. #### Main problems and challenges: 1. **Approximation error**: - Previous methods usually quantize the entire body pose into a single vector, which not only increases the difficulty of encoding all joint information but also leads to the loss of spatial relationships between joints. 2. **Complexity and precision**: - Direct quantization of the entire pose will complicate the encoding process and make it difficult to maintain the spatial relationships between joints, thus affecting the quality of the generated motions. 3. **Preservation of spatio - temporal structure**: - In order to better capture and utilize the spatio - temporal information in motions, a method is needed that can preserve this information during the quantization process. #### Solutions: The authors propose a new method, called **MoGenTS (Motion Generation based on Spatial - Temporal Joint Modeling)**, which solves the above problems in the following ways: 1. **Single - joint quantization**: - Each joint is individually quantized into a vector instead of quantizing the entire pose into a single vector. This simplifies the quantization process and preserves the spatial relationships between joints. 2. **Spatio - temporal two - dimensional structure**: - The quantized joint information is organized into a two - dimensional structure, similar to the two - dimensional representation of an image. This allows the application of various 2D operations (such as 2D convolution, 2D position encoding, and 2D attention mechanisms), thereby enhancing the model's ability to process spatio - temporal signals. 3. **Spatio - temporal modeling framework**: - A spatio - temporal modeling framework based on 2D joint VQ - VAE, spatio - temporal 2D masking techniques, and spatio - temporal 2D attention mechanisms is proposed to fully utilize the spatio - temporal information in the quantized 2D tokens. 4. **Masking strategy**: - A spatio - temporal two - dimensional masking strategy is designed to randomly mask some tokens during the training process, so that the model can better learn and predict the masked parts. 5. **Experimental verification**: - The experimental results show that this method significantly outperforms existing methods on multiple datasets, especially on the HumanML3D and KIT - ML datasets, with FID reduced by 26.6% and 29.9% respectively. ### Summary This paper effectively solves the problems of approximation error and spatio - temporal information preservation in motion generation by improving the quantization method and introducing a new spatio - temporal modeling framework, thereby improving the accuracy and quality of generating human motions based on text prompts.

MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling

Towards Efficient and Diverse Generative Model for Unconditional Human Motion Synthesis

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

ViMo: Generating Motions from Casual Videos

T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations

Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models

TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation

Motion Mamba: Efficient and Long Sequence Motion Generation

MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

PoseGPT: Quantization-based 3D Human Motion Generation and Forecasting

MoMask: Generative Masked Modeling of 3D Human Motions

MoManifold: Learning to Measure 3D Human Motion via Decoupled Joint Acceleration Manifolds

T2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated Data

TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration

Dynamic Motion Transition: A Hybrid Data-driven and Model-driven Method for Human Pose Transitions

Cross-Modal Quantization for Co-Speech Gesture Generation

MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators

Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment

Kinematics Modeling Network for Video-based Human Pose Estimation

MMM: Generative Masked Motion Model