M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Seunggeun Chi,Hyung-gun Chi,Hengbo Ma,Nakul Agarwal,Faizan Siddiqui,Karthik Ramani,Kwonjoon Lee
2024-07-20
Abstract:We introduce the Multi-Motion Discrete Diffusion Models (M2D2M), a novel approach for human motion generation from textual descriptions of multiple actions, utilizing the strengths of discrete diffusion models. This approach adeptly addresses the challenge of generating multi-motion sequences, ensuring seamless transitions of motions and coherence across a series of actions. The strength of M2D2M lies in its dynamic transition probability within the discrete diffusion model, which adapts transition probabilities based on the proximity between motion tokens, encouraging mixing between different modes. Complemented by a two-phase sampling strategy that includes independent and joint denoising steps, M2D2M effectively generates long-term, smooth, and contextually coherent human motion sequences, utilizing a model trained for single-motion generation. Extensive experiments demonstrate that M2D2M surpasses current state-of-the-art benchmarks for motion generation from text descriptions, showcasing its efficacy in interpreting language semantics and generating dynamic, realistic motions.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to generate a coherent and natural multi - human - action sequence from text descriptions. Specifically, the paper focuses on generating a human - motion sequence that can transition smoothly and is highly consistent with the text description given a series of action descriptions. This challenge is mainly reflected in two aspects: 1. **Multi - action generation**: Most of the existing methods mainly focus on the generation of a single action, while generating a sequence containing multiple consecutive actions requires dealing with the problem of smooth transitions between actions. Directly splicing single actions together often leads to unnatural transitions, affecting the coherence and authenticity of the overall action. 2. **Text - to - action alignment**: The generated actions need not only to look realistic visually, but also to be highly matched with the semantic information in the text description. This means that each generated action segment needs to accurately reflect its corresponding text description, and the entire action sequence needs to maintain consistency and coherence. To solve these problems, the paper proposes the M2D2M (Multi - Motion Discrete Diffusion Models) method, which effectively generates long - term, smooth and context - coherent multi - human - action sequences by introducing dynamic transition probabilities and a two - stage sampling strategy. Specific technical details include: - **Dynamic transition probabilities**: By considering the distance between action tokens, the transition probabilities are dynamically adjusted, so as to explore diverse actions in the early diffusion process and gradually converge to accurate actions in the later diffusion process. - **Two - stage sampling strategy**: First, generate the basic outline of the entire action sequence through joint sampling, and then refine each action segment through independent sampling to ensure that each action segment is highly consistent with its corresponding text description while maintaining a smooth transition between actions. These technical means work together, making M2D2M outperform the existing benchmark models in the multi - action generation task, demonstrating its effectiveness and superiority in generating high - quality multi - human - action sequences.