Abstract:The focus of this paper is on 3D motion editing. Given a 3D human motion and a textual description of the desired modification, our goal is to generate an edited motion as described by the text. The key challenges include the scarcity of training data and the need to design a model that accurately edits the source motion. In this paper, we address both challenges. We propose a methodology to semi-automatically collect a dataset of triplets comprising (i) a source motion, (ii) a target motion, and (iii) an edit text, introducing the new MotionFix dataset. Access to this data allows us to train a conditional diffusion model, TMED, that takes both the source motion and the edit text as input. We develop several baselines to evaluate our model, comparing it against models trained solely on text-motion pair datasets, and demonstrate the superior performance of our model trained on triplets. We also introduce new retrieval-based metrics for motion editing, establishing a benchmark on the evaluation set of MotionFix. Our results are promising, paving the way for further research in fine-grained motion generation. Code, models, and data are available at <a class="link-external link-https" href="https://motionfix.is.tue.mpg.de/" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **text - driven 3D human motion editing**. Specifically, given a 3D human motion and a text instruction describing the required modification, the goal is to generate a new, edited motion that conforms to the text description. ### Problem Background and Challenges 1. **Lack of Training Data**: In existing research, there are few high - quality datasets to support the task of editing 3D human motions from natural - language descriptions. 2. **Complexity of Model Design**: It is necessary to design a model that can faithfully edit the source motion according to text instructions, which involves fine - tuning of motions and complex semantic understanding. ### Main Contributions of the Paper To solve the above problems, the authors have made the following contributions: 1. **Constructing the MotionFix Dataset**: This is a brand - new, semi - automatically collected 3D human - motion - editing dataset, containing triples of source motions, target motions, and editing texts. This dataset makes it possible to train and evaluate text - driven motion - editing models. 2. **Introducing the TMED Model**: That is, the Text - based Motion Editing Diffusion Model, a diffusion - model - based framework that can receive source motions and editing texts as inputs and generate edited motions. 3. **Proposing New Evaluation Metrics**: To better evaluate model performance, the authors introduced new retrieval - based metrics and established a new benchmark on the MotionFix dataset. ### Specific Implementation Methods - **Dataset Construction**: By mining existing Motion Capture (MoCap) datasets, find similar motion pairs, and have human annotators manually describe the differences between these motion pairs. Use the TMR (Text - Motion Retrieval) model to ensure that the similarity and difference of motion pairs are moderate. - **Model Architecture**: The TMED model adopts a Conditional Diffusion Model, and is trained by combining source motions and editing texts. The model includes multiple encoders (for time steps, texts, and motions) and integrates all input information through a Transformer structure. - **Loss Function**: Use the Mean Squared Error (MSE) as the loss function to optimize the model to minimize the difference between the predicted denoised target motion and the real target motion. ### Application Scenarios The research has broad application prospects, especially in fields such as animation production, virtual reality, and game development, and can greatly improve the efficiency and flexibility of 3D human - motion editing. Through these contributions, this paper not only solves the key problems in current 3D human - motion editing, but also provides a solid foundation and new directions for future research.

MotionFix: Text-Driven 3D Human Motion Editing

Motion Generation from Fine-grained Textual Descriptions

Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation

Contact-aware Human Motion Generation from Textual Descriptions

TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis

Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

Motion Flow Matching for Human Motion Synthesis and Editing

MotionGPT: Human Motion Synthesis with Improved Diversity and Realism via GPT-3 Prompting

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

Iterative Motion Editing with Natural Language

MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model

DiverseMotion: Towards Diverse Human Motion Generation Via Discrete Diffusion

Understanding Text-driven Motion Synthesis with Keyframe Collaboration via Diffusion Models

MMM: Generative Masked Motion Model

Generating Human Interaction Motions in Scenes with Text Control

Morphology Independent Motion Retrieval and Control

Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation

Enhanced Fine-Grained Motion Diffusion for Text-Driven Human Motion Synthesis

MotionEditor: Editing Video Motion via Content-Aware Diffusion

Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval

Enhancing Motion Variation in Text-to-Motion Models via Pose and Video Conditioned Editing