Human Motion Synthesis_ A Diffusion Approach for Motion Stitching and In-Betweening

Michael Adewole,Oluwaseyi Giwa,Favour Nerrise,Martins Osifeko,Ajibola Oyedeji
2024-09-11
Abstract:Human motion generation is an important area of research in many fields. In this work, we tackle the problem of motion stitching and in-betweening. Current methods either require manual efforts, or are incapable of handling longer sequences. To address these challenges, we propose a diffusion model with a transformer-based denoiser to generate realistic human motion. Our method demonstrated strong performance in generating in-betweening sequences, transforming a variable number of input poses into smooth and realistic motion sequences consisting of 75 frames at 15 fps, resulting in a total duration of 5 seconds. We present the performance evaluation of our method using quantitative metrics such as Frechet Inception Distance (FID), Diversity, and Multimodality, along with visual assessments of the generated outputs.
Computer Vision and Pattern Recognition,Human-Computer Interaction,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the **motion stitching and in - betweening problems in human motion synthesis**. Specifically, existing methods either require human intervention or are unable to handle long motion sequences. To address these challenges, the authors propose a method based on the diffusion model and combine it with a Transformer - architecture denoiser to generate realistic human motions. ### Main Problems 1. **Motion Stitching**: Generate a realistic motion sequence that can smoothly pass through the given key frames. These key frames can appear at any position in the sequence. 2. **In - Betweening**: Fill in the missing frames in a given motion sequence to generate smooth and realistic motion. ### Limitations of Existing Methods - **Manual Effort**: Some methods rely on manual adjustment or design, which is both time - consuming and error - prone. - **Inability to Handle Long Sequences**: Existing methods perform poorly when dealing with longer motion sequences and cannot guarantee the quality of the generated motion. ### Proposed Solution The authors propose a method based on the diffusion model, which uses a Transformer - architecture denoiser. The specific steps are as follows: 1. **Encode Input Motion Frames**: Encode the input motion frames and their positions in the sequence, and input them together with the current diffusion step into an encoder Transformer. 2. **Denoising Process**: Predict clean motion data through another encoder Transformer and gradually remove noise. 3. **Repeat Iteration**: The entire process is repeated a predetermined number of times, and finally a smooth and realistic motion sequence is generated. ### Experimental Results The authors demonstrate the effectiveness of this method through quantitative evaluation metrics (such as Frechet Inception Distance (FID), Diversity, and Multimodality) and visual evaluation. Experiments show that this method can perform well when generating a 5 - second - long motion sequence, which consists of 75 frames at a frame rate of 15 fps. ### Formula Representation - **Noise Scheduling Formula**: \[ \beta_t=\beta_{\text{min}}+t\cdot\frac{\beta_{\text{max}}-\beta_{\text{min}}}{T} \] where $\beta_{\text{min}}$ and $\beta_{\text{max}}$ are the minimum and maximum noise levels respectively, and $T$ is the total number of time steps. - **Forward Diffusion Process**: \[ x_t = \sqrt{\bar{\alpha}_t}x_0+\sqrt{1 - \bar{\alpha}_t}\epsilon \] where $\bar{\alpha}_t=\prod_{s = 1}^t(1 - \beta_s)$ and $\epsilon\sim N(0, I)$ is Gaussian noise. - **Reverse Denoising Process**: \[ x_{t - 1}=\hat{x}_t^0+\sqrt{\tilde{\beta}_t}\cdot\epsilon \] where $\hat{x}_t^0$ is the predicted clean motion, $\tilde{\beta}_t$ is the posterior variance, and $\epsilon\sim N(0, I)$ is standard normal noise. Through this method, the authors successfully solve the problems of motion stitching and in - betweening and demonstrate its potential in generating high - quality, diverse motion sequences.