AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation

Xinzhou Wang,Yikai Wang,Junliang Ye,Zhengyi Wang,Fuchun Sun,Pengkun Liu,Ling Wang,Kai Sun,Xintong Wang,Bin He
2024-03-28
Abstract:Advances in 3D generation have facilitated sequential 3D model generation (a.k.a 4D generation), yet its application for animatable objects with large motion remains scarce. Our work proposes AnimatableDreamer, a text-to-4D generation framework capable of generating diverse categories of non-rigid objects on skeletons extracted from a monocular video. At its core, AnimatableDreamer is equipped with our novel optimization design dubbed Canonical Score Distillation (CSD), which lifts 2D diffusion for temporal consistent 4D generation. CSD, designed from a score gradient perspective, generates a canonical model with warp-robustness across different articulations. Notably, it also enhances the authenticity of bones and skinning by integrating inductive priors from a diffusion model. Furthermore, with multi-view distillation, CSD infers invisible regions, thereby improving the fidelity of monocular non-rigid reconstruction. Extensive experiments demonstrate the capability of our method in generating high-flexibility text-guided 3D models from the monocular video, while also showing improved reconstruction performance over existing non-rigid reconstruction methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper introduces a framework named AnimatableDreamer, designed to address the generation and reconstruction of non-rigid 3D models, especially for animatable objects with large-scale motion. The framework leverages textual guidance and skeletal information from monocular videos to generate diverse non-rigid objects while ensuring temporal consistency and deformation robustness. The core innovation lies in a novel optimization design called Canonical Score Distillation (CSD). CSD starts from the perspective of score gradients to generate deformation-robust canonical models and combines the inductive priors of diffusion models to enhance the realism of skeletons and skins. Through multi-view distillation, CSD can infer invisible areas, thereby improving the fidelity of monocular non-rigid reconstruction. Moreover, AnimatableDreamer first extracts skeletons and skins from monocular videos, then generates text-prompted non-rigid 3D models on these skeletons. While maintaining temporal consistency, CSD optimizes the weights of skeletons and skins to ensure the morphological rationality of the model at different joint positions. Experimental results show that AnimatableDreamer performs excellently in generating highly flexible text-guided 3D models from monocular videos and in performance improvement compared to existing non-rigid reconstruction methods. Therefore, this approach is of significant importance for the automatic construction of animatable 3D models in fields such as gaming, virtual reality, and movie special effects.