Abstract:Self-occlusion is common when capturing people in the wild, where the performer do not follow predefined motion scripts. This challenges existing monocular human reconstruction systems that assume full body visibility. We introduce Self-Occluded Avatar Recovery (SOAR), a method for complete human reconstruction from partial observations where parts of the body are entirely unobserved. SOAR leverages structural normal prior and generative diffusion prior to address such an ill-posed reconstruction problem. For structural normal prior, we model human with an reposable surfel model with well-defined and easily readable shapes. For generative diffusion prior, we perform an initial reconstruction and refine it using score distillation. On various benchmarks, we show that SOAR performs favorably than state-of-the-art reconstruction and generation methods, and on-par comparing to concurrent works. Additional video results and code are available at <a class="link-external link-https" href="https://soar-avatar.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper "SOAR: Self-Occluded Avatar Recovery from a Single Video In the Wild" aims to address the problem of recovering self-occluded human models from a single video in the wild. Specifically, the authors propose a method named SOAR (Self-Occluded Avatar Recovery) that can recover highly realistic human models with complete textures and shapes from a single video, even when some body parts are completely invisible. ### Background and Challenges Self-occlusion is very common when capturing dynamic human bodies in the wild, especially when performers do not follow predefined action scripts. Existing monocular human reconstruction systems usually assume full visibility of the human body, which is unrealistic in most unscripted natural scenes. Therefore, recovering a complete human model from partially observed data becomes a highly challenging problem. ### Solution SOAR addresses this problem by combining structured normal priors and generative diffusion priors: 1. **Structured Normal Priors**: Using a relocatable surfel model, which has well-defined and easily readable shapes. 2. **Generative Diffusion Priors**: Refining the results through initial reconstruction and score distillation. ### Method Overview 1. **Preprocessing**: Estimating foreground masks, front and back normal maps, SMPL-X parameters, and video-level text descriptions from input video frames. 2. **Globally Consistent Surfel Model**: Representing the human body as a set of globally consistent 3D Gaussian surfels and performing pose transformations through forward skinning. 3. **Initial Reconstruction**: Combining RGB image supervision and structured normal priors for initial reconstruction and estimating 3D occlusions. 4. **Generative Refinement**: Refining the initial reconstruction results using score distillation sampling to generate a complete and highly realistic human model. ### Experiments and Evaluation The authors conducted experiments on multiple benchmark datasets, including FS-XHumans, DNA-Rendering, and internet videos. The experimental results show that SOAR outperforms existing methods on various metrics, particularly in self-occluded regions. ### Conclusion Although SOAR has made significant progress in recovering self-occluded human models from a single video in the wild, there are still some limitations, such as issues with generating saturated colors, optimization constraints during testing that limit interactive use, and the lack of comprehensive wild datasets with real multi-view annotations. Future research can further improve these aspects to achieve more robust and practical human model recovery techniques.

SOAR: Self-Occluded Avatar Recovery from a Single Video In the Wild

SelfRecon: Self Reconstruction Your Digital Avatar from Monocular Video

In-Hand 3D Object Reconstruction from a Monocular RGB Video

Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-supervised Scene Decomposition

Real-time non-rigid shape recovery via active appearance models for augmented reality

TotalSelfScan: Learning Full-body Avatars from Self-Portrait Videos of Faces, Hands, and Bodies

High-precision Human Body Acquisition Via Multi-View Binocular Stereopsis

S3O: A Dual-Phase Approach for Reconstructing Dynamic Shape and Skeleton of Articulated Objects from Single Monocular Video

Stratified Avatar Generation from Sparse Observations

DreaMo: Articulated 3D Reconstruction From A Single Casual Video

SOAR: Simultaneous Exploration and Photographing with Heterogeneous UAVs for Fast Autonomous Reconstruction

Surfel-based Gaussian Inverse Rendering for Fast and Relightable Dynamic Human Reconstruction from Monocular Video

AvatarCap: Animatable Avatar Conditioned Monocular Human Volumetric Capture

GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras

MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using Differentiable Shading

Video-Based Outdoor Human Reconstruction.

SAOR: Single-View Articulated Object Reconstruction

High-fidelity Facial Avatar Reconstruction from Monocular Video with Generative Priors

Relightable and Animatable Neural Avatar from Sparse-View Video

Coherent3D: Coherent 3D Portrait Video Reconstruction via Triplane Fusion

3D Human Reconstruction in the Wild with Collaborative Aerial Cameras