Abstract:In this paper, we introduce a MusIc conditioned 3D Dance GEneraTion model, named MIDGET based on Dance motion Vector Quantised Variational AutoEncoder (VQ-VAE) model and Motion Generative Pre-Training (GPT) model to generate vibrant and highquality dances that match the music rhythm. To tackle challenges in the field, we introduce three new components: 1) a pre-trained memory codebook based on the Motion VQ-VAE model to store different human pose codes, 2) employing Motion GPT model to generate pose codes with music and motion Encoders, 3) a simple framework for music feature extraction. We compare with existing state-of-the-art models and perform ablation experiments on AIST++, the largest publicly available music-dance dataset. Experiments demonstrate that our proposed framework achieves state-of-the-art performance on motion quality and its alignment with the music.
Sound,Computer Vision and Pattern Recognition,Graphics,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate high - quality 3D dance movements that match the rhythm of music. Specifically, the authors propose a model named MIDGET (MusIc conditioned 3D Dance GEneraTion), aiming to solve the problems existing in the consistency between music and dance movements in existing methods, such as movement freezing and inaccurate alignment of music beats, etc.
### Main Problems and Challenges
1. **Consistency between Music Beats and Movements**:
- Existing methods such as EDGE [26] and Bailando [24] have difficulty ensuring the consistency between music beats and movements when generating dances, resulting in the generated movements may have freezing phenomena.
- The alignment problem between music feature extraction and movement generation, especially the synchronization problem between music beats and movement beats.
2. **Generating High - Quality and Diverse Dance Movements**:
- The generated dance movements should not only be consistent with the music rhythm, but also be of high quality and diversity, avoiding monotonous and repetitive movements.
3. **Efficient Music Feature Extraction**:
- The method of directly down - sampling music features has poor performance, and more effective music feature extraction methods are needed to capture the subtle changes in music.
### Solutions
To solve the above problems, the MIDGET model introduces the following innovations:
1. **Gradient Copying Strategy**:
- Through the gradient copying strategy, the motion generator is directly trained to minimize the music alignment loss, thereby ensuring that the generated movements are highly consistent with the music beats.
2. **Simple Music Feature Extractor**:
- A simple but effective music feature extractor is proposed, which can better capture music features, and has a small number of parameters, improving the efficiency of the model.
3. **Memory Codebook Based on VQ - VAE and Motion GPT**:
- The VQ - VAE model is used to quantize dance movements into discrete codebook representations, and the Motion GPT model is used to generate movement sequences that match the music conditions.
- This method not only improves the quality of the generated movements, but also solves the movement freezing problem in long - sequence generation.
### Experimental Results
The paper conducted experiments on the AIST++ dataset, and the results show that the MIDGET model outperforms the existing state - of - the - art methods in multiple evaluation metrics, especially in FID (Fréchet Inception Distance), Diversity, Beat Align Score, etc.
In conclusion, through proposing the MIDGET model, this paper effectively solves the problem of consistency between music beats and movements in 3D dance generation under music conditions, and generates high - quality and diverse dance movements.