Abstract:Recent neural rendering methods have made great progress in generating photorealistic human avatars. However, these methods are generally conditioned only on low-dimensional driving signals (e.g., body poses), which are insufficient to encode the complete appearance of a clothed human. Hence they fail to generate faithful details. To address this problem, we exploit driving view images (e.g., in telepresence systems) as additional inputs. We propose a novel neural rendering pipeline, Hybrid Volumetric-Textural Rendering (HVTR++), which synthesizes 3D human avatars from arbitrary driving poses and views while staying faithful to appearance details efficiently and at high quality. First, we learn to encode the driving signals of pose and view image on a dense UV manifold of the human body surface and extract UV-aligned features, preserving the structure of a skeleton-based parametric model. To handle complicated motions (e.g., self-occlusions), we then leverage the UV-aligned features to construct a 3D volumetric representation based on a dynamic neural radiance field. While this allows us to represent 3D geometry with changing topology, volumetric rendering is computationally heavy. Hence we employ only a rough volumetric representation using a pose- and image-conditioned downsampled neural radiance field (PID-NeRF), which we can render efficiently at low resolutions. In addition, we learn 2D textural features that are fused with rendered volumetric features in image space. The key advantage of our approach is that we can then convert the fused features into a high-resolution, high-quality avatar by a fast GAN-based textural renderer. We demonstrate that hybrid rendering enables HVTR++ to handle complicated motions, render high-quality avatars under user-controlled poses/shapes, and most importantly, be efficient at inference time. Our experimental results also demonstrate state-of-the-art quantitative results.

TriHuman: A Real-time and Controllable Tri-plane Representation for Detailed Human Geometry and Appearance Synthesis

TriHuman : A Real-time and Controllable Tri-plane Representation for Detailed Human Geometry and Appearance Synthesis

HR Human: Modeling Human Avatars with Triangular Mesh and High-Resolution Textures from Videos

HDHumans: A Hybrid Approach for High-fidelity Digital Humans

Towards 4D Human Video Stylization

Tri$^{2}$-plane: Thinking Head Avatar via Feature Pyramid

Efficient Neural Implicit Representation for 3D Human Reconstruction

TimeWalker: Personalized Neural Space for Lifelong Head Avatars

LatentHuman: Shape-and-Pose Disentangled Latent Representation for Human Bodies

MonoHuman: Animatable Human Neural Field from Monocular Video

MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos

Deformable 3D Gaussian Splatting for Animatable Human Avatars

DoubleFusion: Real-Time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor

Neural Capture of Animatable 3D Human from Monocular Video.

HVTR++: Image and Pose Driven Human Avatars Using Hybrid Volumetric-Textural Rendering.

Coherent 3D Portrait Video Reconstruction via Triplane Fusion

Coherent3D: Coherent 3D Portrait Video Reconstruction via Triplane Fusion

PGAHum: Prior-Guided Geometry and Appearance Learning for High-Fidelity Animatable Human Reconstruction

HVTR: Hybrid Volumetric-Textural Rendering for Human Avatars

3D Real Human Reconstruction Via Multiple Low-Cost Depth Cameras.

Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis