Abstract:Text-conditioned human motion generation, which allows for user interaction through natural language, has become increasingly popular. Existing methods typically generate short, isolated motions based on a single input sentence. However, human motions are continuous and can extend over long periods, carrying rich semantics. Creating long, complex motions that precisely respond to streams of text descriptions, particularly in an online and real-time setting, remains a significant challenge. Furthermore, incorporating spatial constraints into text-conditioned motion generation presents additional challenges, as it requires aligning the motion semantics specified by text descriptions with geometric information, such as goal locations and 3D scene geometry. To address these limitations, we propose DART, a Diffusion-based Autoregressive motion primitive model for Real-time Text-driven motion control. Our model, DART, effectively learns a compact motion primitive space jointly conditioned on motion history and text inputs using latent diffusion models. By autoregressively generating motion primitives based on the preceding history and current text input, DART enables real-time, sequential motion generation driven by natural language descriptions. Additionally, the learned motion primitive space allows for precise spatial motion control, which we formulate either as a latent noise optimization problem or as a Markov decision process addressed through reinforcement learning. We present effective algorithms for both approaches, demonstrating our model's versatility and superior performance in various motion synthesis tasks. Experiments show our method outperforms existing baselines in motion realism, efficiency, and controllability. Video results are available on the project page: <a class="link-external link-https" href="https://zkf1997.github.io/DART/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Real - time generation of long - time - series and complex actions**: Existing text - conditioned human motion generation methods are usually only able to generate short and isolated motion segments based on a single input sentence. However, human motions are continuous and long - lasting, and contain rich semantic information. Therefore, creating long - time - series and complex actions that can accurately respond to the flow of text descriptions, especially in online and real - time environments, remains a major challenge. 2. **Incorporation of spatial constraints**: Incorporating spatial constraints (such as target locations and 3D scene geometries) into text - conditioned motion generation brings additional challenges. This requires aligning the motion semantics specified by the text description with the geometric information, ensuring that the motions not only meet the text requirements but also adapt to the specific spatial environment. To address these challenges, the authors propose **DART** (Diffusion - based Autoregressive Motion Model for Real - Time Text - driven Motion Control), an autoregressive motion primitive model based on the diffusion model. DART solves the above problems in the following ways: - **Autoregressive motion primitive representation**: DART decomposes long - time human motions into a series of overlapping short motion segments (motion primitives), thereby simplifying the data distribution and making generative learning more efficient. Each motion primitive includes historical frames and future frames, and in this way, efficient online inference and generation can be achieved. - **Learning of compact motion spaces conditioned on text**: DART uses a latent diffusion model to learn a compact motion space conditioned on text and motion history from large - scale data. By autoregressively generating motion primitives, DART can synthesize motion sequences of arbitrary length according to real - time text input while maintaining an efficient generation speed. - **Precise spatial control**: The compact motion space learned by DART allows for precise spatial control through latent space optimization or reinforcement learning algorithms. Specifically, motion sequences that satisfy text and spatial constraints can be generated by optimizing latent noise or training reinforcement learning strategies. In summary, DART aims to overcome the limitations of existing methods in generating long - time - series and complex actions and incorporating spatial constraints, providing an efficient and general - purpose solution applicable to a variety of motion synthesis tasks.

DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control

MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model

AMD: Autoregressive Motion Diffusion

Interactive Character Control with Auto-Regressive Motion Diffusion Models

Understanding Text-driven Motion Synthesis with Keyframe Collaboration via Diffusion Models

Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

Enhanced Fine-Grained Motion Diffusion for Text-Driven Human Motion Synthesis

Controllable Motion Synthesis and Reconstruction with Autoregressive Diffusion Models

Taming Diffusion Probabilistic Models for Character Control

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model

Text-driven Human Motion Generation with Motion Masked Diffusion Model

Act As You Wish: Fine-Grained Control of Motion Diffusion Model with Hierarchical Semantic Graphs

Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation

Efficient Text-driven Motion Generation via Latent Consistency Training

Realistic Human Motion Generation with Cross-Diffusion Models

MMM: Generative Masked Motion Model

Human Motion Diffusion Model

ControlMM: Controllable Masked Motion Generation