Abstract:In this paper, we introduce a MusIc conditioned 3D Dance GEneraTion model, named MIDGET based on Dance motion Vector Quantised Variational AutoEncoder (VQ-VAE) model and Motion Generative Pre-Training (GPT) model to generate vibrant and highquality dances that match the music rhythm. To tackle challenges in the field, we introduce three new components: 1) a pre-trained memory codebook based on the Motion VQ-VAE model to store different human pose codes, 2) employing Motion GPT model to generate pose codes with music and motion Encoders, 3) a simple framework for music feature extraction. We compare with existing state-of-the-art models and perform ablation experiments on AIST++, the largest publicly available music-dance dataset. Experiments demonstrate that our proposed framework achieves state-of-the-art performance on motion quality and its alignment with the music.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to generate high - quality 3D dance movements that match the rhythm of music. Specifically, the authors propose a model named MIDGET (MusIc conditioned 3D Dance GEneraTion), aiming to solve the problems existing in the consistency between music and dance movements in existing methods, such as movement freezing and inaccurate alignment of music beats, etc. ### Main Problems and Challenges 1. **Consistency between Music Beats and Movements**: - Existing methods such as EDGE [26] and Bailando [24] have difficulty ensuring the consistency between music beats and movements when generating dances, resulting in the generated movements may have freezing phenomena. - The alignment problem between music feature extraction and movement generation, especially the synchronization problem between music beats and movement beats. 2. **Generating High - Quality and Diverse Dance Movements**: - The generated dance movements should not only be consistent with the music rhythm, but also be of high quality and diversity, avoiding monotonous and repetitive movements. 3. **Efficient Music Feature Extraction**: - The method of directly down - sampling music features has poor performance, and more effective music feature extraction methods are needed to capture the subtle changes in music. ### Solutions To solve the above problems, the MIDGET model introduces the following innovations: 1. **Gradient Copying Strategy**: - Through the gradient copying strategy, the motion generator is directly trained to minimize the music alignment loss, thereby ensuring that the generated movements are highly consistent with the music beats. 2. **Simple Music Feature Extractor**: - A simple but effective music feature extractor is proposed, which can better capture music features, and has a small number of parameters, improving the efficiency of the model. 3. **Memory Codebook Based on VQ - VAE and Motion GPT**: - The VQ - VAE model is used to quantize dance movements into discrete codebook representations, and the Motion GPT model is used to generate movement sequences that match the music conditions. - This method not only improves the quality of the generated movements, but also solves the movement freezing problem in long - sequence generation. ### Experimental Results The paper conducted experiments on the AIST++ dataset, and the results show that the MIDGET model outperforms the existing state - of - the - art methods in multiple evaluation metrics, especially in FID (Fréchet Inception Distance), Diversity, Beat Align Score, etc. In conclusion, through proposing the MIDGET model, this paper effectively solves the problem of consistency between music beats and movements in 3D dance generation under music conditions, and generates high - quality and diverse dance movements.

MIDGET: Music Conditioned 3D Dance Generation

DanceFormer: Music Conditioned 3D Dance Generation with Parametric Motion Transformer

QEAN: quaternion-enhanced attention network for visual dance generation

Genre-Conditioned Long-Term 3D Dance Generation Driven by Music

TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration

DanceMeld: Unraveling Dance Phrases with Hierarchical Latent Codes for Music-to-Dance Synthesis

MAGMA: Music Aligned Generative Motion Autodecoder

A deep learning model of dance generation for young children based on music rhythm and beat

Example-Based Automatic Music-Driven Conventional Dance Motion Synthesis

DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation

LongDanceDiff: Long-term Dance Generation with Conditional Diffusion Model

Dance2MIDI: Dance-driven multi-instrument music generation

Music2Dance: DanceNet for Music-Driven Dance Generation

Quantized GAN for Complex Music Generation from Dance Videos

Bailando++: 3D Dance GPT With Choreographic Memory

Dance2MIDI: Dance-driven multi-instruments music generation

Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory

Bidirectional Autoregressive Diffusion Model for Dance Generation