MAGMA: Music Aligned Generative Motion Autodecoder

Sohan Anisetty,Amit Raj,James Hays
2023-09-03
Abstract:Mapping music to dance is a challenging problem that requires spatial and temporal coherence along with a continual synchronization with the music's progression. Taking inspiration from large language models, we introduce a 2-step approach for generating dance using a Vector Quantized-Variational Autoencoder (VQ-VAE) to distill motion into primitives and train a Transformer decoder to learn the correct sequencing of these primitives. We also evaluate the importance of music representations by comparing naive music feature extraction using Librosa to deep audio representations generated by state-of-the-art audio compression algorithms. Additionally, we train variations of the motion generator using relative and absolute positional encodings to determine the effect on generated motion quality when generating arbitrarily long sequence lengths. Our proposed approach achieve state-of-the-art results in music-to-motion generation benchmarks and enables the real-time generation of considerably longer motion sequences, the ability to chain multiple motion sequences seamlessly, and easy customization of motion sequences to meet style requirements.
Graphics,Computer Vision and Pattern Recognition,Multimedia,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve the problem of automatic mapping between music and dance. Specifically, it aims to generate dance movements that are synchronized with music and are spatially and temporally coherent. This involves several key challenges: 1. **Synchronization between music and dance**: Dance movements need to be in line with the rhythm and melody of music. This requires that the generated dance not only conforms to the style of music but also can change with the changes of music. 2. **Long - sequence generation**: The generated dance movements need to be able to last for a long time without freezing or unnatural sliding. 3. **Multi - modal fusion**: In addition to music conditions, it is also required to be able to generate dances according to text - style prompts, so that the generated dances not only conform to music but also can reflect specific dance styles. To solve these problems, the author proposes a two - step framework, specifically including: - **Step 1: Use VQ - VAE (Vector Quantized Variational Autoencoder) to compress motion data into discrete "motion primitives"**: - In this step, VQ - VAE is used to encode complex motion data into discrete codebook indices, which can be regarded as the basic units of motion. - The formula is expressed as: \[ z_q^i=\arg\min_{c_k \in C}\|z_e^i - e_k\|^2 \] where \(z_e^i\) is the \(i\) - th latent feature output by the encoder, \(C\) is the codebook, and \(e_k\) is the \(k\) - th embedding vector in the codebook. - **Step 2: Train a Transformer decoder to learn the correct ordering of these motion primitives**: - The decoder is an autoregressive generation model. It gradually generates a sequence of motion primitives according to the input music features and optional text - style prompts. - The optimization objective function is: \[ L_{\text{GPT}} = -\sum_{i = 1}^{|S|}\log P_\theta(S_i|S_{< i}, c) \] where \(S\) is the sequence of motion primitives, \(c\) is the conditional input (music features and text style), and \(\theta\) is the model parameter. In addition, the author also evaluated different music representation methods and compared the effects of simple music features extracted by Librosa and the deep audio representation generated by Encodec. The experimental results show that the music features generated by Encodec perform better when generating longer sequences, and the generated dance movements are more diverse and closer to the real data. In conclusion, this paper successfully solves the problem of automatic mapping between music and dance by combining VQ - VAE and Transformer models and has achieved the state - of - the - art results on multiple benchmark datasets.