Abstract:This paper introduces DanceFusion, a novel framework for reconstructing and generating dance movements synchronized to music, utilizing a Spatio-Temporal Skeleton Diffusion Transformer. The framework adeptly handles incomplete and noisy skeletal data common in short-form dance videos on social media platforms like TikTok. DanceFusion incorporates a hierarchical Transformer-based Variational Autoencoder (VAE) integrated with a diffusion model, significantly enhancing motion realism and accuracy. Our approach introduces sophisticated masking techniques and a unique iterative diffusion process that refines the motion sequences, ensuring high fidelity in both motion generation and synchronization with accompanying audio cues. Comprehensive evaluations demonstrate that DanceFusion surpasses existing methods, providing state-of-the-art performance in generating dynamic, realistic, and stylistically diverse dance motions. Potential applications of this framework extend to content creation, virtual reality, and interactive entertainment, promising substantial advancements in automated dance generation. Visit our project page at <a class="link-external link-https" href="https://th-mlab.github.io/DanceFusion/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to reconstruct and generate high - quality dance movements synchronized with music from short - format dance videos on social media platforms (such as TikTok), especially when these videos usually contain incomplete and noisy data. Specifically, the paper proposes the DanceFusion framework, aiming at: 1. **Handling Incomplete and Noisy Data**: Dance videos generated by users on social media platforms are usually of varying quality, with problems such as background noise, partial occlusion, and low resolution. These problems make it difficult for traditional computer vision and pose estimation models to effectively process these data. DanceFusion can handle these incomplete and noisy data more accurately by introducing sophisticated masking techniques and a unique iterative diffusion process. 2. **Generating Dance Movements Synchronized with Music**: Dance movements need to be not only visually realistic but also precisely synchronized with the rhythm of the background music. DanceFusion integrates variational auto - encoders (VAE) and diffusion models, which can generate high - fidelity dance movements and ensure that these movements are highly synchronized with audio cues. 3. **Improving the Realism and Accuracy of Movements**: DanceFusion utilizes the spatio - temporal skeleton diffusion Transformer, significantly enhancing the realism and accuracy of movements. Through multi - level spatio - temporal encoding, the model can capture the spatial configuration of joints and the temporal dynamics of movement, thereby generating more natural and realistic dance movements. ### Main Research Objectives 1. **Hierarchical Spatio - Temporal VAE for Movement Reconstruction**: Develop a Transformer - based variational auto - encoder that can accurately reconstruct incomplete TikTok dance movements and maintain high precision even in the case of missing data or high noise. 2. **Integrated Diffusion Model for Audio - Driven Movement Generation**: Integrate the diffusion model into the VAE framework to generate high - fidelity, audio - synchronized dance movements. 3. **Evaluating the Effectiveness of the Framework**: Use a diverse dataset of TikTok dance sequences to evaluate the performance of this framework in movement reconstruction and generation, and verify its effectiveness in real - world scenarios. ### Contributions 1. **Hierarchical Spatio - Temporal VAE**: Introduce a Transformer - based variational auto - encoder that can effectively capture the spatial configuration of joints and temporal dynamics, thereby more accurately reconstructing incomplete and noisy data. 2. **Integration of Diffusion Models**: By combining diffusion models, DanceFusion can iteratively optimize movement sequences, significantly enhancing the realism of movements and ensuring precise synchronization with audio input. 3. **Advanced Masking Techniques**: Develop sophisticated masking strategies to manage missing or unreliable joint data, enabling the model to prioritize available information and improve reconstruction accuracy. In conclusion, the DanceFusion framework aims to solve the data quality problems in dance videos on social media platforms and generate high - quality, music - synchronized dance movements, providing new possibilities for content creation, virtual reality, and interactive entertainment, etc.

DanceFusion: A Spatio-Temporal Skeleton Diffusion Transformer for Audio-Driven Dance Motion Reconstruction

Dance Any Beat: Blending Beats with Visuals in Dance Video Generation

DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation

DanceFormer: Music Conditioned 3D Dance Generation with Parametric Motion Transformer

DanceIt: Music-Inspired Dancing Video Synthesis

Learning to Generate Diverse Dance Motions with Transformer

Dance2Music-Diffusion: leveraging latent diffusion models for music generation from dance videos

LongDanceDiff: Long-term Dance Generation with Conditional Diffusion Model

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

Robust Dancer: Long-term 3D Dance Synthesis Using Unpaired Data

DisCo: Disentangled Control for Realistic Human Dance Generation

Towards 3D Dance Motion Synthesis and Control

DanceCamAnimator: Keyframe-Based Controllable 3D Dance Camera Synthesis

DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses

Music2Dance: DanceNet for Music-Driven Dance Generation

DanceAnyWay: Synthesizing Beat-Guided 3D Dances with Randomized Temporal Contrastive Learning

DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

Image Comes Dancing With Collaborative Parsing-Flow Video Synthesis

DanceMeld: Unraveling Dance Phrases with Hierarchical Latent Codes for Music-to-Dance Synthesis