DanceFusion: A Spatio-Temporal Skeleton Diffusion Transformer for Audio-Driven Dance Motion Reconstruction

Li Zhao,Zhengmin Lu
2024-11-07
Abstract:This paper introduces DanceFusion, a novel framework for reconstructing and generating dance movements synchronized to music, utilizing a Spatio-Temporal Skeleton Diffusion Transformer. The framework adeptly handles incomplete and noisy skeletal data common in short-form dance videos on social media platforms like TikTok. DanceFusion incorporates a hierarchical Transformer-based Variational Autoencoder (VAE) integrated with a diffusion model, significantly enhancing motion realism and accuracy. Our approach introduces sophisticated masking techniques and a unique iterative diffusion process that refines the motion sequences, ensuring high fidelity in both motion generation and synchronization with accompanying audio cues. Comprehensive evaluations demonstrate that DanceFusion surpasses existing methods, providing state-of-the-art performance in generating dynamic, realistic, and stylistically diverse dance motions. Potential applications of this framework extend to content creation, virtual reality, and interactive entertainment, promising substantial advancements in automated dance generation. Visit our project page at <a class="link-external link-https" href="https://th-mlab.github.io/DanceFusion/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to reconstruct and generate high - quality dance movements synchronized with music from short - format dance videos on social media platforms (such as TikTok), especially when these videos usually contain incomplete and noisy data. Specifically, the paper proposes the DanceFusion framework, aiming at: 1. **Handling Incomplete and Noisy Data**: Dance videos generated by users on social media platforms are usually of varying quality, with problems such as background noise, partial occlusion, and low resolution. These problems make it difficult for traditional computer vision and pose estimation models to effectively process these data. DanceFusion can handle these incomplete and noisy data more accurately by introducing sophisticated masking techniques and a unique iterative diffusion process. 2. **Generating Dance Movements Synchronized with Music**: Dance movements need to be not only visually realistic but also precisely synchronized with the rhythm of the background music. DanceFusion integrates variational auto - encoders (VAE) and diffusion models, which can generate high - fidelity dance movements and ensure that these movements are highly synchronized with audio cues. 3. **Improving the Realism and Accuracy of Movements**: DanceFusion utilizes the spatio - temporal skeleton diffusion Transformer, significantly enhancing the realism and accuracy of movements. Through multi - level spatio - temporal encoding, the model can capture the spatial configuration of joints and the temporal dynamics of movement, thereby generating more natural and realistic dance movements. ### Main Research Objectives 1. **Hierarchical Spatio - Temporal VAE for Movement Reconstruction**: Develop a Transformer - based variational auto - encoder that can accurately reconstruct incomplete TikTok dance movements and maintain high precision even in the case of missing data or high noise. 2. **Integrated Diffusion Model for Audio - Driven Movement Generation**: Integrate the diffusion model into the VAE framework to generate high - fidelity, audio - synchronized dance movements. 3. **Evaluating the Effectiveness of the Framework**: Use a diverse dataset of TikTok dance sequences to evaluate the performance of this framework in movement reconstruction and generation, and verify its effectiveness in real - world scenarios. ### Contributions 1. **Hierarchical Spatio - Temporal VAE**: Introduce a Transformer - based variational auto - encoder that can effectively capture the spatial configuration of joints and temporal dynamics, thereby more accurately reconstructing incomplete and noisy data. 2. **Integration of Diffusion Models**: By combining diffusion models, DanceFusion can iteratively optimize movement sequences, significantly enhancing the realism of movements and ensuring precise synchronization with audio input. 3. **Advanced Masking Techniques**: Develop sophisticated masking strategies to manage missing or unreliable joint data, enabling the model to prioritize available information and improve reconstruction accuracy. In conclusion, the DanceFusion framework aims to solve the data quality problems in dance videos on social media platforms and generate high - quality, music - synchronized dance movements, providing new possibilities for content creation, virtual reality, and interactive entertainment, etc.