SOAR: Self-Occluded Avatar Recovery from a Single Video In the Wild

Zhuoyang Pan,Angjoo Kanazawa,Hang Gao
2024-10-31
Abstract:Self-occlusion is common when capturing people in the wild, where the performer do not follow predefined motion scripts. This challenges existing monocular human reconstruction systems that assume full body visibility. We introduce Self-Occluded Avatar Recovery (SOAR), a method for complete human reconstruction from partial observations where parts of the body are entirely unobserved. SOAR leverages structural normal prior and generative diffusion prior to address such an ill-posed reconstruction problem. For structural normal prior, we model human with an reposable surfel model with well-defined and easily readable shapes. For generative diffusion prior, we perform an initial reconstruction and refine it using score distillation. On various benchmarks, we show that SOAR performs favorably than state-of-the-art reconstruction and generation methods, and on-par comparing to concurrent works. Additional video results and code are available at <a class="link-external link-https" href="https://soar-avatar.github.io/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve The paper "SOAR: Self-Occluded Avatar Recovery from a Single Video In the Wild" aims to address the problem of recovering self-occluded human models from a single video in the wild. Specifically, the authors propose a method named SOAR (Self-Occluded Avatar Recovery) that can recover highly realistic human models with complete textures and shapes from a single video, even when some body parts are completely invisible. ### Background and Challenges Self-occlusion is very common when capturing dynamic human bodies in the wild, especially when performers do not follow predefined action scripts. Existing monocular human reconstruction systems usually assume full visibility of the human body, which is unrealistic in most unscripted natural scenes. Therefore, recovering a complete human model from partially observed data becomes a highly challenging problem. ### Solution SOAR addresses this problem by combining structured normal priors and generative diffusion priors: 1. **Structured Normal Priors**: Using a relocatable surfel model, which has well-defined and easily readable shapes. 2. **Generative Diffusion Priors**: Refining the results through initial reconstruction and score distillation. ### Method Overview 1. **Preprocessing**: Estimating foreground masks, front and back normal maps, SMPL-X parameters, and video-level text descriptions from input video frames. 2. **Globally Consistent Surfel Model**: Representing the human body as a set of globally consistent 3D Gaussian surfels and performing pose transformations through forward skinning. 3. **Initial Reconstruction**: Combining RGB image supervision and structured normal priors for initial reconstruction and estimating 3D occlusions. 4. **Generative Refinement**: Refining the initial reconstruction results using score distillation sampling to generate a complete and highly realistic human model. ### Experiments and Evaluation The authors conducted experiments on multiple benchmark datasets, including FS-XHumans, DNA-Rendering, and internet videos. The experimental results show that SOAR outperforms existing methods on various metrics, particularly in self-occluded regions. ### Conclusion Although SOAR has made significant progress in recovering self-occluded human models from a single video in the wild, there are still some limitations, such as issues with generating saturated colors, optimization constraints during testing that limit interactive use, and the lack of comprehensive wild datasets with real multi-view annotations. Future research can further improve these aspects to achieve more robust and practical human model recovery techniques.