Abstract:Mapping music to dance is a challenging problem that requires spatial and temporal coherence along with a continual synchronization with the music's progression. Taking inspiration from large language models, we introduce a 2-step approach for generating dance using a Vector Quantized-Variational Autoencoder (VQ-VAE) to distill motion into primitives and train a Transformer decoder to learn the correct sequencing of these primitives. We also evaluate the importance of music representations by comparing naive music feature extraction using Librosa to deep audio representations generated by state-of-the-art audio compression algorithms. Additionally, we train variations of the motion generator using relative and absolute positional encodings to determine the effect on generated motion quality when generating arbitrarily long sequence lengths. Our proposed approach achieve state-of-the-art results in music-to-motion generation benchmarks and enables the real-time generation of considerably longer motion sequences, the ability to chain multiple motion sequences seamlessly, and easy customization of motion sequences to meet style requirements.

What problem does this paper attempt to address?

This paper attempts to solve the problem of automatic mapping between music and dance. Specifically, it aims to generate dance movements that are synchronized with music and are spatially and temporally coherent. This involves several key challenges: 1. **Synchronization between music and dance**: Dance movements need to be in line with the rhythm and melody of music. This requires that the generated dance not only conforms to the style of music but also can change with the changes of music. 2. **Long - sequence generation**: The generated dance movements need to be able to last for a long time without freezing or unnatural sliding. 3. **Multi - modal fusion**: In addition to music conditions, it is also required to be able to generate dances according to text - style prompts, so that the generated dances not only conform to music but also can reflect specific dance styles. To solve these problems, the author proposes a two - step framework, specifically including: - **Step 1: Use VQ - VAE (Vector Quantized Variational Autoencoder) to compress motion data into discrete "motion primitives"**: - In this step, VQ - VAE is used to encode complex motion data into discrete codebook indices, which can be regarded as the basic units of motion. - The formula is expressed as: \[ z_q^i=\arg\min_{c_k \in C}\|z_e^i - e_k\|^2 \] where \(z_e^i\) is the \(i\) - th latent feature output by the encoder, \(C\) is the codebook, and \(e_k\) is the \(k\) - th embedding vector in the codebook. - **Step 2: Train a Transformer decoder to learn the correct ordering of these motion primitives**: - The decoder is an autoregressive generation model. It gradually generates a sequence of motion primitives according to the input music features and optional text - style prompts. - The optimization objective function is: \[ L_{\text{GPT}} = -\sum_{i = 1}^{|S|}\log P_\theta(S_i|S_{< i}, c) \] where \(S\) is the sequence of motion primitives, \(c\) is the conditional input (music features and text style), and \(\theta\) is the model parameter. In addition, the author also evaluated different music representation methods and compared the effects of simple music features extracted by Librosa and the deep audio representation generated by Encodec. The experimental results show that the music features generated by Encodec perform better when generating longer sequences, and the generated dance movements are more diverse and closer to the real data. In conclusion, this paper successfully solves the problem of automatic mapping between music and dance by combining VQ - VAE and Transformer models and has achieved the state - of - the - art results on multiple benchmark datasets.

MAGMA: Music Aligned Generative Motion Autodecoder

audeosynth: music-driven video montage

Quantized GAN for Complex Music Generation from Dance Videos

Example-Based Automatic Music-Driven Conventional Dance Motion Synthesis

TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration

DanceMeld: Unraveling Dance Phrases with Hierarchical Latent Codes for Music-to-Dance Synthesis

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

Pose Estimation-Assisted Dance Tracking System Based on Convolutional Neural Network

QEAN: quaternion-enhanced attention network for visual dance generation

MIDGET: Music Conditioned 3D Dance Generation

Genre-Conditioned Long-Term 3D Dance Generation Driven by Music

Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers

Music Conditioned Generation for Human-Centric Video

DeepDance: Music-to-Dance Motion Choreography With Adversarial Learning

Dance2Music: Automatic Dance-driven Music Generation

Generative Autoregressive Networks for 3D Dancing Move Synthesis From Music

Learning source-aware representations of music in a discrete latent space

Music2Dance: DanceNet for Music-Driven Dance Generation