Abstract:In this paper we present a novel method to estimate 3D human pose and shape from monocular videos. This task requires directly recovering pixel-alignment 3D human pose and body shape from monocular images or videos, which is challenging due to its inherent ambiguity. To improve precision, existing methods highly rely on the initialized mean pose and shape as prior estimates and parameter regression with an iterative error feedback manner. In addition, video-based approaches model the overall change over the image-level features to temporally enhance the single-frame feature, but fail to capture the rotational motion at the joint level, and cannot guarantee local temporal consistency. To address these issues, we propose a novel Transformer-based model with a design of independent tokens. First, we introduce three types of tokens independent of the image feature: \textit{joint rotation tokens, shape token, and camera token}. By progressively interacting with image features through Transformer layers, these tokens learn to encode the prior knowledge of human 3D joint rotations, body shape, and position information from large-scale data, and are updated to estimate SMPL parameters conditioned on a given image. Second, benefiting from the proposed token-based representation, we further use a temporal model to focus on capturing the rotational temporal information of each joint, which is empirically conducive to preventing large jitters in local parts. Despite being conceptually simple, the proposed method attains superior performances on the 3DPW and Human3.6M datasets. Using ResNet-50 and Transformer architectures, it obtains 42.0 mm error on the PA-MPJPE metric of the challenging 3DPW, outperforming state-of-the-art counterparts by a large margin. Code will be publicly available at <a class="link-external link-https" href="https://github.com/yangsenius/INT_HMR_Model" rel="external noopener nofollow">this https URL</a>

Towards Accurate Markerless Human Shape and Pose Estimation over Time

Marker-Less 3d Human Motion Capture With Monocular Image Sequence And Height-Maps

Live Stream Temporally Embedded 3D Human Body Pose and Shape Estimation

Shape and Pose Estimation for Closely Interacting Persons Using Multi-view Images.

General Automatic Human Shape and Motion Capture Using Volumetric Contour Cues

New multi-view human motion capture framework

Markerless Human Body Motion Capture Using Multiple Cameras

Parallel‐branch Network for 3D Human Pose and Shape Estimation in Video

Optimization and Soft Constraints for Human Shape and Pose Estimation Based on a 3D Morphable Model

Model-Based Markerless Human Body Motion Capture using Multiple Cameras

Full-body Motion Capture for Multiple Closely Interacting Persons.

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video

Markerless motion capture of multiple characters using multiview image segmentation

Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

3D Human Pose Estimation from Deep Multi-View 2D Pose

Markerless 3D human pose tracking through multiple cameras and AI: Enabling high accuracy, robustness, and real-time performance

Multi-view Human Pose and Shape Estimation Using Learnable Volumetric Aggregation

Human model adaptation for multiview markerless motion capture

Markerless 3D Human Motion Tracking for Monocular Video Sequences

SkelFormer: Markerless 3D Pose and Shape Estimation using Skeletal Transformers