Abstract:High-fidelity digital human representations are increasingly in demand in the digital world, particularly for interactive telepresence, AR/VR, 3D graphics, and the rapidly evolving metaverse. Even though they work well in small spaces, conventional methods for reconstructing 3D human motion frequently require the use of expensive hardware and have high processing costs. This study presents HumanAvatar, an innovative approach that efficiently reconstructs precise human avatars from monocular video sources. At the core of our methodology, we integrate the pre-trained HuMoR, a model celebrated for its proficiency in human motion estimation. This is adeptly fused with the cutting-edge neural radiance field technology, Instant-NGP, and the state-of-the-art articulated model, Fast-SNARF, to enhance the reconstruction fidelity and speed. By combining these two technologies, a system is created that can render quickly and effectively while also providing estimation of human pose parameters that are unmatched in accuracy. We have enhanced our system with an advanced posture-sensitive space reduction technique, which optimally balances rendering quality with computational efficiency. In our detailed experimental analysis using both artificial and real-world monocular videos, we establish the advanced performance of our approach. HumanAvatar consistently equals or surpasses contemporary leading-edge reconstruction techniques in quality. Furthermore, it achieves these complex reconstructions in minutes, a fraction of the time typically required by existing methods. Our models achieve a training speed that is 110X faster than that of State-of-The-Art (SoTA) NeRF-based models. Our technique performs noticeably better than SoTA dynamic human NeRF methods if given an identical runtime limit. HumanAvatar can provide effective visuals after only 30 seconds of training.

An Efficient Graph Transformer Network for Video-Based Human Mesh Reconstruction.

Dual-Branch Graph Transformer Network for 3D Human Mesh Reconstruction from Video

Mixed Transformer for Temporal 3D Human Pose and Shape Estimation from Monocular Video

GATOR: Graph-Aware Transformer with Motion-Disentangled Regression for Human Mesh Recovery from a 2D Pose

Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

Image-Guided Human Reconstruction via Multi-Scale Graph Transformation Networks

Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers

3D Human Pose Estimation with Spatial and Temporal Transformers

ProGraph: Temporally-alignable Probability Guided Graph Topological Modeling for 3D Human Reconstruction

Multi-hop graph transformer network for 3D human pose estimation

A Spatio-Temporal Transformer Network for Human Motion Prediction in Human-Robot Collaboration

Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos

3D Human Mesh Reconstruction by Learning to Sample Joint Adaptive Tokens for Transformers

3D hand pose and mesh estimation via a generic Topology-aware Transformer model

Geometry-Biased Transformer for Robust Multi-View 3D Human Pose Reconstruction

Multi-view 3D Reconstruction from Video with Transformer.

Graph-aware transformer for skeleton-based action recognition

Efficient Neural Implicit Representation for 3D Human Reconstruction

Disambiguating Monocular Reconstruction of 3D Clothed Human with Spatial-Temporal Transformer

Humans in 4D: Reconstructing and Tracking Humans with Transformers

MUG: Multi-human Graph Network for 3D Mesh Reconstruction from 2D Pose