Instruct-Video2Avatar: Video-to-Avatar Generation with Instructions

Shaoxu Li
2023-06-05
Abstract:We propose a method for synthesizing edited photo-realistic digital avatars with text instructions. Given a short monocular RGB video and text instructions, our method uses an image-conditioned diffusion model to edit one head image and uses the video stylization method to accomplish the editing of other head images. Through iterative training and update (three times or more), our method synthesizes edited photo-realistic animatable 3D neural head avatars with a deformable neural radiance field head synthesis method. In quantitative and qualitative studies on various subjects, our method outperforms state-of-the-art methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to solve is how to generate edited, photorealistic, and animatable 3D neural head avatars from short monocular RGB videos and text instructions. Specifically, the paper proposes a method called Instruct-Video2Avatar, which can edit the head of a person in the input video according to the user's text instructions (e.g., "make him look 100 years old," "make him happier," etc.), thereby generating personalized 3D avatars that meet the user's expectations. Currently, although there are some technologies that can generate realistic 3D avatars, these technologies still have shortcomings in terms of personalization, especially when it comes to stylized editing based on specific user requirements. Therefore, the main contribution of this paper is to provide a new solution that allows users to easily create personalized and realistic 3D avatars through simple text instructions, which has significant application value in fields such as virtual reality, gaming, and film production.