Abstract:Human motion generation is an important area of research in many fields. In this work, we tackle the problem of motion stitching and in-betweening. Current methods either require manual efforts, or are incapable of handling longer sequences. To address these challenges, we propose a diffusion model with a transformer-based denoiser to generate realistic human motion. Our method demonstrated strong performance in generating in-betweening sequences, transforming a variable number of input poses into smooth and realistic motion sequences consisting of 75 frames at 15 fps, resulting in a total duration of 5 seconds. We present the performance evaluation of our method using quantitative metrics such as Frechet Inception Distance (FID), Diversity, and Multimodality, along with visual assessments of the generated outputs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the **motion stitching and in - betweening problems in human motion synthesis**. Specifically, existing methods either require human intervention or are unable to handle long motion sequences. To address these challenges, the authors propose a method based on the diffusion model and combine it with a Transformer - architecture denoiser to generate realistic human motions. ### Main Problems 1. **Motion Stitching**: Generate a realistic motion sequence that can smoothly pass through the given key frames. These key frames can appear at any position in the sequence. 2. **In - Betweening**: Fill in the missing frames in a given motion sequence to generate smooth and realistic motion. ### Limitations of Existing Methods - **Manual Effort**: Some methods rely on manual adjustment or design, which is both time - consuming and error - prone. - **Inability to Handle Long Sequences**: Existing methods perform poorly when dealing with longer motion sequences and cannot guarantee the quality of the generated motion. ### Proposed Solution The authors propose a method based on the diffusion model, which uses a Transformer - architecture denoiser. The specific steps are as follows: 1. **Encode Input Motion Frames**: Encode the input motion frames and their positions in the sequence, and input them together with the current diffusion step into an encoder Transformer. 2. **Denoising Process**: Predict clean motion data through another encoder Transformer and gradually remove noise. 3. **Repeat Iteration**: The entire process is repeated a predetermined number of times, and finally a smooth and realistic motion sequence is generated. ### Experimental Results The authors demonstrate the effectiveness of this method through quantitative evaluation metrics (such as Frechet Inception Distance (FID), Diversity, and Multimodality) and visual evaluation. Experiments show that this method can perform well when generating a 5 - second - long motion sequence, which consists of 75 frames at a frame rate of 15 fps. ### Formula Representation - **Noise Scheduling Formula**: \[ \beta_t=\beta_{\text{min}}+t\cdot\frac{\beta_{\text{max}}-\beta_{\text{min}}}{T} \] where $\beta_{\text{min}}$ and $\beta_{\text{max}}$ are the minimum and maximum noise levels respectively, and $T$ is the total number of time steps. - **Forward Diffusion Process**: \[ x_t = \sqrt{\bar{\alpha}_t}x_0+\sqrt{1 - \bar{\alpha}_t}\epsilon \] where $\bar{\alpha}_t=\prod_{s = 1}^t(1 - \beta_s)$ and $\epsilon\sim N(0, I)$ is Gaussian noise. - **Reverse Denoising Process**: \[ x_{t - 1}=\hat{x}_t^0+\sqrt{\tilde{\beta}_t}\cdot\epsilon \] where $\hat{x}_t^0$ is the predicted clean motion, $\tilde{\beta}_t$ is the posterior variance, and $\epsilon\sim N(0, I)$ is standard normal noise. Through this method, the authors successfully solve the problems of motion stitching and in - betweening and demonstrate its potential in generating high - quality, diverse motion sequences.

Human Motion Synthesis_ A Diffusion Approach for Motion Stitching and In-Betweening

Learning a Deep Motion Interpolation Network for Human Skeleton Animations

MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis

Human Motion Diffusion as a Generative Prior

Motion Flow Matching for Human Motion Synthesis and Editing

Flexible Motion In-betweening with Diffusion Models

Human Motion Diffusion Model

Synthesizing Moving People with 3D Control

StableMoFusion: Towards Robust and Efficient Diffusion-based Motion Generation Framework

InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions

Guided Motion Diffusion for Controllable Human Motion Synthesis

MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model

MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation

Generating Continual Human Motion in Diverse 3D Scenes

Diverse Motion In-betweening from Sparse Keyframes with Dual Posture Stitching

BoDiffusion: Diffusing Sparse Observations for Full-Body Human Motion Synthesis

Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling

Diverse Motion In-betweening with Dual Posture Stitching

Robust Diffusion‐based Motion In‐betweening

AMD:Anatomical Motion Diffusion with Interpretable Motion Decomposition and Fusion