LongDanceDiff: Long-term Dance Generation with Conditional Diffusion Model

Siqi Yang,Zejun Yang,Zhisheng Wang

2023-08-23

Abstract:Dancing with music is always an essential human art form to express emotion. Due to the high temporal-spacial complexity, long-term 3D realist dance generation synchronized with music is challenging. Existing methods suffer from the freezing problem when generating long-term dances due to error accumulation and training-inference discrepancy. To address this, we design a conditional diffusion model, LongDanceDiff, for this sequence-to-sequence long-term dance generation, addressing the challenges of temporal coherency and spatial constraint. LongDanceDiff contains a transformer-based diffusion model, where the input is a concatenation of music, past motions, and noised future motions. This partial noising strategy leverages the full-attention mechanism and learns the dependencies among music and past motions. To enhance the diversity of generated dance motions and mitigate the freezing problem, we introduce a mutual information minimization objective that regularizes the dependency between past and future motions. We also address common visual quality issues in dance generation, such as foot sliding and unsmooth motion, by incorporating spatial constraints through a Global-Trajectory Modulation (GTM) layer and motion perceptual losses, thereby improving the smoothness and naturalness of motion generation. Extensive experiments demonstrate a significant improvement in our approach over the existing state-of-the-art methods. We plan to release our codes and models soon.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

This paper attempts to address the issue of long-term 3D dance generation synchronized with music. Specifically, existing methods encounter freezing problems (due to error accumulation and training-inference inconsistency) and abrupt transition issues when generating longer dance sequences. To solve these problems, the paper proposes a Transformer-based conditional diffusion model—LongDanceDiff. This model improves the quality of dance generation through the following points: 1. **Conditional Diffusion Model**: Utilizes music and past actions as conditions to generate future actions. This approach can generate action sequences that are synchronized with music and continuous. 2. **Mutual Information Minimization Objective**: To reduce over-reliance on past actions and improve the diversity of generated sequences, a mutual information minimization objective is introduced. 3. **Spatial Constraints**: Addresses foot sliding issues by introducing a Global-Trajectory Modulation Layer and improves the smoothness and naturalness of actions through motion perceptual losses. With these improvements, LongDanceDiff is able to surpass existing methods in generating high-quality, diverse long-term dance sequences. Experimental results show that this method outperforms other baseline methods in terms of visual quality and music synchronization.

LongDanceDiff: Long-term Dance Generation with Conditional Diffusion Model

DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation

Bidirectional Autoregressive Diffusion Model for Dance Generation

Dance2Music-Diffusion: leveraging latent diffusion models for music generation from dance videos

Music2Dance: DanceNet for Music-Driven Dance Generation

Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning

DanceFormer: Music Conditioned 3D Dance Generation with Parametric Motion Transformer

Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives

Robust Dancer: Long-term 3D Dance Synthesis Using Unpaired Data

Towards 3D Dance Motion Synthesis and Control

EnchantDance: Unveiling the Potential of Music-Driven Dance Movement

Example-Based Automatic Music-Driven Conventional Dance Motion Synthesis

Flexible Music-Conditioned Dance Generation with Style Description Prompts

Genre-Conditioned Long-Term 3D Dance Generation Driven by Music

Dance Any Beat: Blending Beats with Visuals in Dance Video Generation

Dance revolution: long-term dance genera-

DanceFusion: A Spatio-Temporal Skeleton Diffusion Transformer for Audio-Driven Dance Motion Reconstruction

DanceMeld: Unraveling Dance Phrases with Hierarchical Latent Codes for Music-to-Dance Synthesis

Dance Your Latents: Consistent Dance Generation through Spatial-temporal Subspace Attention Guided by Motion Flow

Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation