MotionFix: Text-Driven 3D Human Motion Editing

Nikos Athanasiou,Alpár Ceske,Markos Diomataris,Michael J. Black,Gül Varol
2024-09-20
Abstract:The focus of this paper is on 3D motion editing. Given a 3D human motion and a textual description of the desired modification, our goal is to generate an edited motion as described by the text. The key challenges include the scarcity of training data and the need to design a model that accurately edits the source motion. In this paper, we address both challenges. We propose a methodology to semi-automatically collect a dataset of triplets comprising (i) a source motion, (ii) a target motion, and (iii) an edit text, introducing the new MotionFix dataset. Access to this data allows us to train a conditional diffusion model, TMED, that takes both the source motion and the edit text as input. We develop several baselines to evaluate our model, comparing it against models trained solely on text-motion pair datasets, and demonstrate the superior performance of our model trained on triplets. We also introduce new retrieval-based metrics for motion editing, establishing a benchmark on the evaluation set of MotionFix. Our results are promising, paving the way for further research in fine-grained motion generation. Code, models, and data are available at <a class="link-external link-https" href="https://motionfix.is.tue.mpg.de/" rel="external noopener nofollow">this https URL</a> .
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **text - driven 3D human motion editing**. Specifically, given a 3D human motion and a text instruction describing the required modification, the goal is to generate a new, edited motion that conforms to the text description. ### Problem Background and Challenges 1. **Lack of Training Data**: In existing research, there are few high - quality datasets to support the task of editing 3D human motions from natural - language descriptions. 2. **Complexity of Model Design**: It is necessary to design a model that can faithfully edit the source motion according to text instructions, which involves fine - tuning of motions and complex semantic understanding. ### Main Contributions of the Paper To solve the above problems, the authors have made the following contributions: 1. **Constructing the MotionFix Dataset**: This is a brand - new, semi - automatically collected 3D human - motion - editing dataset, containing triples of source motions, target motions, and editing texts. This dataset makes it possible to train and evaluate text - driven motion - editing models. 2. **Introducing the TMED Model**: That is, the Text - based Motion Editing Diffusion Model, a diffusion - model - based framework that can receive source motions and editing texts as inputs and generate edited motions. 3. **Proposing New Evaluation Metrics**: To better evaluate model performance, the authors introduced new retrieval - based metrics and established a new benchmark on the MotionFix dataset. ### Specific Implementation Methods - **Dataset Construction**: By mining existing Motion Capture (MoCap) datasets, find similar motion pairs, and have human annotators manually describe the differences between these motion pairs. Use the TMR (Text - Motion Retrieval) model to ensure that the similarity and difference of motion pairs are moderate. - **Model Architecture**: The TMED model adopts a Conditional Diffusion Model, and is trained by combining source motions and editing texts. The model includes multiple encoders (for time steps, texts, and motions) and integrates all input information through a Transformer structure. - **Loss Function**: Use the Mean Squared Error (MSE) as the loss function to optimize the model to minimize the difference between the predicted denoised target motion and the real target motion. ### Application Scenarios The research has broad application prospects, especially in fields such as animation production, virtual reality, and game development, and can greatly improve the efficiency and flexibility of 3D human - motion editing. Through these contributions, this paper not only solves the key problems in current 3D human - motion editing, but also provides a solid foundation and new directions for future research.