Abstract:Text-to-motion models that generate sequences of human poses from textual descriptions are garnering significant attention. However, due to data scarcity, the range of motions these models can produce is still limited. For instance, current text-to-motion models cannot generate a motion of kicking a football with the instep of the foot, since the training data only includes martial arts kicks. We propose a novel method that uses short video clips or images as conditions to modify existing basic motions. In this approach, the model's understanding of a kick serves as the prior, while the video or image of a football kick acts as the posterior, enabling the generation of the desired motion. By incorporating these additional modalities as conditions, our method can create motions not present in the training set, overcoming the limitations of text-motion datasets. A user study with 26 participants demonstrated that our approach produces unseen motions with realism comparable to commonly represented motions in text-motion datasets (e.g., HumanML3D), such as walking, running, squatting, and kicking.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper aims to solve the problem of insufficient diversity in human action sequences generated by text - to - motion models. Specifically, due to the scarcity of training data, existing text - to - motion models have limitations in generating certain specific actions. For example, these models are unable to generate the action of kicking a football with the instep because the training data only contains martial arts kicking actions. To solve this problem, the author proposes a novel method to modify existing basic actions by introducing short video clips or images as conditions. This method utilizes the detailed information in videos or images, enabling the model to generate actions that do not appear in the training dataset. This not only increases the diversity of actions but also improves the realism of the generated actions. ### Method overview 1. **Problem background**: - Text - to - motion models rely on pairs of text descriptions and corresponding action sequences for training. - Collecting high - quality action data is challenging, resulting in data scarcity. - Existing models have difficulty in capturing and reproducing the complex and diverse human actions described in the input text. 2. **Proposed solution**: - Use short video clips or images as conditions to modify existing basic actions. - By regarding the model's understanding of an action as prior knowledge (prior), and the video or image as posterior knowledge (posterior), generate the required motion. - This method can create actions that do not exist in the training dataset, overcoming the limitations of text - motion datasets. 3. **Experimental verification**: - Evaluate the realism of the generated actions through a user study (with 26 participants). - The results show that the new actions generated by this method are comparable in realism to common actions (such as walking, running, squatting, etc.). ### Formula explanation The formulas involved in the paper are mainly used to describe the working principle of the diffusion model. Here are several key formulas: 1. **Noise addition process**: \[ q(\mathbf{x}_t|\mathbf{x}_{t - 1}) := \mathcal{N}(\mathbf{x}_t,\sqrt{1 - \beta_t}\mathbf{x}_{t - 1},\beta_t\mathbf{I}) \] where $\mathbf{x}_t$ is the noisy data, $\mathbf{x}_{t - 1}$ is the data from the previous step, and $\beta_t$ is the variance scheduling parameter. 2. **Reverse diffusion process**: \[ p_\theta(\mathbf{x}_{t - 1}|\mathbf{x}_t) := \mathcal{N}(\mathbf{x}_{t - 1},\mu_\theta(\mathbf{x}_t,t),\Sigma_\theta(\mathbf{x}_t,t)) \] where $\mu_\theta(\mathbf{x}_t,t)$ and $\Sigma_\theta(\mathbf{x}_t,t)$ are the mean and variance predictions respectively. 3. **Loss function**: \[ L = ||\tilde{\mu}(\mathbf{x}_t,\mathbf{x}_0)-\mu_\theta(\mathbf{x}_t,t)||^2 \] where $\tilde{\mu}(\mathbf{x}_t,\mathbf{x}_0)$ is the true mean and $\mu_\theta(\mathbf{x}_t,t)$ is the predicted mean. Through these formulas, the model can gradually remove noise during the iteration process and finally generate new actions that meet the conditions. ### Summary By introducing videos or images as conditions, this paper significantly enhances the diversity and authenticity of actions generated by text - to - motion models. This method not only expands the capabilities of the model but also provides new ideas for future research.

Enhancing Motion Variation in Text-to-Motion Models via Pose and Video Conditioned Editing

Motion Prompting: Controlling Video Generation with Motion Trajectories

NewMove: Customizing text-to-video models with novel motions

Motion Control for Enhanced Complex Action Video Generation

TEMOS: Generating diverse human motions from textual descriptions

Fg-T2M: Fine-Grained Text-Driven Human Motion Generation Via Diffusion Model

MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

Motion Generation from Fine-grained Textual Descriptions

DiverseMotion: Towards Diverse Human Motion Generation Via Discrete Diffusion

T2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated Data

Towards motion from video diffusion models

Enhanced Fine-Grained Motion Diffusion for Text-Driven Human Motion Synthesis

Make-an-animation: Large-scale text-conditional 3D human motion generation

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

Understanding Text-driven Motion Synthesis with Keyframe Collaboration via Diffusion Models

Fleximo: Towards Flexible Text-to-Human Motion Video Generation

Animate Your Motion: Turning Still Images into Dynamic Videos

MotionBooth: Motion-Aware Customized Text-to-Video Generation

CoMo: Controllable Motion Generation through Language Guided Pose Code Editing