STAR: Skeleton-aware Text-based 4D Avatar Generation with In-Network Motion Retargeting

Zenghao Chai,Chen Tang,Yongkang Wong,Mohan Kankanhalli
2024-06-07
Abstract:The creation of 4D avatars (i.e., animated 3D avatars) from text description typically uses text-to-image (T2I) diffusion models to synthesize 3D avatars in the canonical space and subsequently applies animation with target motions. However, such an optimization-by-animation paradigm has several drawbacks. (1) For pose-agnostic optimization, the rendered images in canonical pose for naive Score Distillation Sampling (SDS) exhibit domain gap and cannot preserve view-consistency using only T2I priors, and (2) For post hoc animation, simply applying the source motions to target 3D avatars yields translation artifacts and misalignment. To address these issues, we propose Skeleton-aware Text-based 4D Avatar generation with in-network motion Retargeting (STAR). STAR considers the geometry and skeleton differences between the template mesh and target avatar, and corrects the mismatched source motion by resorting to the pretrained motion retargeting techniques. With the informatively retargeted and occlusion-aware skeleton, we embrace the skeleton-conditioned T2I and text-to-video (T2V) priors, and propose a hybrid SDS module to coherently provide multi-view and frame-consistent supervision signals. Hence, STAR can progressively optimize the geometry, texture, and motion in an end-to-end manner. The quantitative and qualitative experiments demonstrate our proposed STAR can synthesize high-quality 4D avatars with vivid animations that align well with the text description. Additional ablation studies shows the contributions of each component in STAR. The source code and demos are available at: \href{<a class="link-external link-https" href="https://star-avatar.github.io" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://star-avatar.github.io" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Graphics,Multimedia
What problem does this paper attempt to address?
This paper proposes a solution to the problem of text-based 4D virtual image generation. Existing methods usually adopt text-to-image (T2I) diffusion models to synthesize 3D images from textual descriptions, and then apply target motions for animation. This approach has two drawbacks: 1) for pose-invariant optimization, the images rendered with fixed standard poses using original score distillation sampling (SDS) have domain gaps, and cannot maintain view consistency solely through T2I prior; 2) when performing animation, simply applying source motions to target 3D images can result in translation distortion and alignment errors. To address these issues, the paper proposes a model called STAR (Skeleton-aware Text-based 4D Avatar Generation with In-Network Motion Retargeting). STAR takes into account the geometric and skeletal differences between the template mesh and target image, and utilizes pre-training motion retargeting techniques to correct mismatched source motions. By incorporating informative retargeting and occlusion-aware skeleton, STAR combines skeleton-conditioned T2I and T2V priors, and introduces a hybrid SDS module to provide multi-view and frame-consistent supervision signals, thereby optimizing geometry, texture, and motion in an end-to-end manner. The experimental results demonstrate that STAR can generate high-quality 4D virtual images with vivid animations aligned well with textual descriptions. In addition, the paper conducts ablation studies to showcase the contributions of each component of STAR, and provides source code and demonstrations. In summary, this paper addresses the issues of view consistency, animation quality, and skeleton-geometry matching in text-based 4D virtual image generation process. By integrating motion retargeting and hybrid SDS techniques, it improves the realism and animation effects of the generated 4D virtual images.