Abstract:Tremendous efforts have been made to learn animatable and photorealistic human avatars. Towards this end, both explicit and implicit 3D representations are heavily studied for a holistic modeling and capture of the whole human (e.g., body, clothing, face and hair), but neither representation is an optimal choice in terms of representation efficacy since different parts of the human avatar have different modeling desiderata. For example, meshes are generally not suitable for modeling clothing and hair. Motivated by this, we present Disentangled Avatars~(DELTA), which models humans with hybrid explicit-implicit 3D representations. DELTA takes a monocular RGB video as input, and produces a human avatar with separate body and clothing/hair layers. Specifically, we demonstrate two important applications for DELTA. For the first one, we consider the disentanglement of the human body and clothing and in the second, we disentangle the face and hair. To do so, DELTA represents the body or face with an explicit mesh-based parametric 3D model and the clothing or hair with an implicit neural radiance field. To make this possible, we design an end-to-end differentiable renderer that integrates meshes into volumetric rendering, enabling DELTA to learn directly from monocular videos without any 3D supervision. Finally, we show that how these two applications can be easily combined to model full-body avatars, such that the hair, face, body and clothing can be fully disentangled yet jointly rendered. Such a disentanglement enables hair and clothing transfer to arbitrary body shapes. We empirically validate the effectiveness of DELTA's disentanglement by demonstrating its promising performance on disentangled reconstruction, virtual clothing try-on and hairstyle transfer. To facilitate future research, we also release an open-sourced pipeline for the study of hybrid human avatar modeling.

Disentangled Human Action Video Generation Via Decoupled Learning.

Realistic Face Reenactment Via Self-Supervised Disentangling of Identity and Pose

Ihair Recolorer: Deep Image-to-video Hair Color Transfer

Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation

Pose Guided Human Video Generation

Deep Video Generation, Prediction and Completion of Human Action Sequences

Action2video: Generating Videos of Human 3D Actions

Learning Disentangled Avatars with Hybrid 3D Representations

Deformable Generator Networks: Unsupervised Disentanglement of Appearance and Geometry

Disentangled Representation Learning for Controllable Person Image Generation

DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation

Generating diverse clothed 3D human animations via a generative model

DPE: Disentanglement of Pose and Expression for General Video Portrait Editing

DisCo: Disentangled Control for Realistic Human Dance Generation

GenDeF: Learning Generative Deformation Field for Video Generation

Exploiting video sequences for unsupervised disentangling in generative adversarial networks

Human Motion Transfer from Poses in the Wild

DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Music Conditioned Generation for Human-Centric Video

L-C4: Language-Based Video Colorization for Creative and Consistent Color

Human4DiT: 360-degree Human Video Generation with 4D Diffusion Transformer