MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling

Weihao Yuan,Weichao Shen,Yisheng He,Yuan Dong,Xiaodong Gu,Zilong Dong,Liefeng Bo,Qixing Huang
2024-09-26
Abstract:Motion generation from discrete quantization offers many advantages over continuous regression, but at the cost of inevitable approximation errors. Previous methods usually quantize the entire body pose into one code, which not only faces the difficulty in encoding all joints within one vector but also loses the spatial relationship between different joints. Differently, in this work we quantize each individual joint into one vector, which i) simplifies the quantization process as the complexity associated with a single joint is markedly lower than that of the entire pose; ii) maintains a spatial-temporal structure that preserves both the spatial relationships among joints and the temporal movement patterns; iii) yields a 2D token map, which enables the application of various 2D operations widely used in 2D images. Grounded in the 2D motion quantization, we build a spatial-temporal modeling framework, where 2D joint VQVAE, temporal-spatial 2D masking technique, and spatial-temporal 2D attention are proposed to take advantage of spatial-temporal signals among the 2D tokens. Extensive experiments demonstrate that our method significantly outperforms previous methods across different datasets, with a $26.6\%$ decrease of FID on HumanML3D and a $29.9\%$ decrease on KIT-ML.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of generating human motion sequences based on text prompts. Specifically, researchers are concerned with how to generate motions from the perspective of discrete quantization rather than the traditional continuous regression. This transformation can simplify the task, but it also introduces inevitable approximation errors. #### Main problems and challenges: 1. **Approximation error**: - Previous methods usually quantize the entire body pose into a single vector, which not only increases the difficulty of encoding all joint information but also leads to the loss of spatial relationships between joints. 2. **Complexity and precision**: - Direct quantization of the entire pose will complicate the encoding process and make it difficult to maintain the spatial relationships between joints, thus affecting the quality of the generated motions. 3. **Preservation of spatio - temporal structure**: - In order to better capture and utilize the spatio - temporal information in motions, a method is needed that can preserve this information during the quantization process. #### Solutions: The authors propose a new method, called **MoGenTS (Motion Generation based on Spatial - Temporal Joint Modeling)**, which solves the above problems in the following ways: 1. **Single - joint quantization**: - Each joint is individually quantized into a vector instead of quantizing the entire pose into a single vector. This simplifies the quantization process and preserves the spatial relationships between joints. 2. **Spatio - temporal two - dimensional structure**: - The quantized joint information is organized into a two - dimensional structure, similar to the two - dimensional representation of an image. This allows the application of various 2D operations (such as 2D convolution, 2D position encoding, and 2D attention mechanisms), thereby enhancing the model's ability to process spatio - temporal signals. 3. **Spatio - temporal modeling framework**: - A spatio - temporal modeling framework based on 2D joint VQ - VAE, spatio - temporal 2D masking techniques, and spatio - temporal 2D attention mechanisms is proposed to fully utilize the spatio - temporal information in the quantized 2D tokens. 4. **Masking strategy**: - A spatio - temporal two - dimensional masking strategy is designed to randomly mask some tokens during the training process, so that the model can better learn and predict the masked parts. 5. **Experimental verification**: - The experimental results show that this method significantly outperforms existing methods on multiple datasets, especially on the HumanML3D and KIT - ML datasets, with FID reduced by 26.6% and 29.9% respectively. ### Summary This paper effectively solves the problems of approximation error and spatio - temporal information preservation in motion generation by improving the quantization method and introducing a new spatio - temporal modeling framework, thereby improving the accuracy and quality of generating human motions based on text prompts.