Abstract:Background Model-based 3D pose estimation has been widely used in many 3D human motion analysis applications, in which vision-based and inertial-based are two distinct lines. Multi-view images in a vision-based markerless capture system provide essential data for motion analysis, but erroneous estimates still occur due to ambiguities, occlusion, or noise in images. Besides, the multi-view setting is hard for the application in the wild. Although inertial measurement units (IMUs) can obtain accurate direction without occlusion, they are usually susceptible to magnetic field interference and drifts. Hybrid motion capture has drawn the attention of researchers in recent years. Existing 3D pose estimation methods jointly optimize the parameters of the 3D pose by minimizing the discrepancy between the image and IMU data. However, these hybrid methods still suffer from the issues such as complex peripheral devices, sensitivity to initialization, and slow convergence. Methods This article presents an approach to improve 3D human pose estimation by fusing a single image with sparse inertial measurement units (IMUs). Based on a dual-stream feature extract network, we design a model-attention network with a residual module to closely couple the dual-modal feature from a static image and sparse inertial measurement units. The final 3D pose and shape parameters are directly obtained by a regression strategy. Results Extensive experiments are conducted on two benchmark datasets for 3D human pose estimation. Compared to state-of-the-art methods, the per vertex error (PVE) of human mesh reduces by 9.4 mm on Total Capture dataset and the mean per joint position error (MPJPE) reduces by 7.8 mm on the Human3.6M dataset. The quantitative comparison demonstrates that the proposed method could effectively fuse sparse IMU data and images and improve pose accuracy.

3D Human Pose Estimation Via Spatio-Temporal Matching from Monocular RGB Images.

3D Point-to-Keypoint Voting Network for 6D Pose Estimation

X-HRNet: Towards Lightweight Human Pose Estimation with Spatially Unidimensional Self-Attention

Marker-Less 3d Human Motion Capture With Monocular Image Sequence And Height-Maps

3D Human Pose Estimation = 2D Pose Estimation + Matching

A Survey on Monocular 3D Human Pose Estimation

Unified End-to-End YOLOv5-HR-TCM Framework for Automatic 2D/3D Human Pose Estimation for Real-Time Applications

3D Human Pose Estimation using Spatio-Temporal Networks with Explicit Occlusion Training

3D Human Pose Estimation with Spatio-Temporal Criss-Cross Attention

Monocular 3D Human Pose Estimation by Predicting Depth on Joints

P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation

Robust Estimation of 3D Human Poses from a Single Image

Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video

Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

3D-UGCN: A Unified Graph Convolutional Network for Robust 3D Human Pose Estimation from Monocular RGB Images

Motion Capture Research: 3D Human Pose Recovery Based on RGB Video Sequences

Locally Connected Network for Monocular 3D Human Pose Estimation

Monocular 3D human pose estimation via sequential second order cone programming

Hand Pose Estimation via Latent 2.5D Heatmap Regression

Reconstructing 3D human pose and shape from a single image and sparse IMUs