Enhancing Motion Variation in Text-to-Motion Models via Pose and Video Conditioned Editing

Clayton Leite,Yu Xiao
2024-10-11
Abstract:Text-to-motion models that generate sequences of human poses from textual descriptions are garnering significant attention. However, due to data scarcity, the range of motions these models can produce is still limited. For instance, current text-to-motion models cannot generate a motion of kicking a football with the instep of the foot, since the training data only includes martial arts kicks. We propose a novel method that uses short video clips or images as conditions to modify existing basic motions. In this approach, the model's understanding of a kick serves as the prior, while the video or image of a football kick acts as the posterior, enabling the generation of the desired motion. By incorporating these additional modalities as conditions, our method can create motions not present in the training set, overcoming the limitations of text-motion datasets. A user study with 26 participants demonstrated that our approach produces unseen motions with realism comparable to commonly represented motions in text-motion datasets (e.g., HumanML3D), such as walking, running, squatting, and kicking.
Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper aims to solve the problem of insufficient diversity in human action sequences generated by text - to - motion models. Specifically, due to the scarcity of training data, existing text - to - motion models have limitations in generating certain specific actions. For example, these models are unable to generate the action of kicking a football with the instep because the training data only contains martial arts kicking actions. To solve this problem, the author proposes a novel method to modify existing basic actions by introducing short video clips or images as conditions. This method utilizes the detailed information in videos or images, enabling the model to generate actions that do not appear in the training dataset. This not only increases the diversity of actions but also improves the realism of the generated actions. ### Method overview 1. **Problem background**: - Text - to - motion models rely on pairs of text descriptions and corresponding action sequences for training. - Collecting high - quality action data is challenging, resulting in data scarcity. - Existing models have difficulty in capturing and reproducing the complex and diverse human actions described in the input text. 2. **Proposed solution**: - Use short video clips or images as conditions to modify existing basic actions. - By regarding the model's understanding of an action as prior knowledge (prior), and the video or image as posterior knowledge (posterior), generate the required motion. - This method can create actions that do not exist in the training dataset, overcoming the limitations of text - motion datasets. 3. **Experimental verification**: - Evaluate the realism of the generated actions through a user study (with 26 participants). - The results show that the new actions generated by this method are comparable in realism to common actions (such as walking, running, squatting, etc.). ### Formula explanation The formulas involved in the paper are mainly used to describe the working principle of the diffusion model. Here are several key formulas: 1. **Noise addition process**: \[ q(\mathbf{x}_t|\mathbf{x}_{t - 1}) := \mathcal{N}(\mathbf{x}_t,\sqrt{1 - \beta_t}\mathbf{x}_{t - 1},\beta_t\mathbf{I}) \] where $\mathbf{x}_t$ is the noisy data, $\mathbf{x}_{t - 1}$ is the data from the previous step, and $\beta_t$ is the variance scheduling parameter. 2. **Reverse diffusion process**: \[ p_\theta(\mathbf{x}_{t - 1}|\mathbf{x}_t) := \mathcal{N}(\mathbf{x}_{t - 1},\mu_\theta(\mathbf{x}_t,t),\Sigma_\theta(\mathbf{x}_t,t)) \] where $\mu_\theta(\mathbf{x}_t,t)$ and $\Sigma_\theta(\mathbf{x}_t,t)$ are the mean and variance predictions respectively. 3. **Loss function**: \[ L = ||\tilde{\mu}(\mathbf{x}_t,\mathbf{x}_0)-\mu_\theta(\mathbf{x}_t,t)||^2 \] where $\tilde{\mu}(\mathbf{x}_t,\mathbf{x}_0)$ is the true mean and $\mu_\theta(\mathbf{x}_t,t)$ is the predicted mean. Through these formulas, the model can gradually remove noise during the iteration process and finally generate new actions that meet the conditions. ### Summary By introducing videos or images as conditions, this paper significantly enhances the diversity and authenticity of actions generated by text - to - motion models. This method not only expands the capabilities of the model but also provides new ideas for future research.