Abstract:Multi-view 3D human pose estimation is naturally superior to single view one, benefiting from more comprehensive information provided by images of multiple views. The information includes camera poses, 2D/3D human poses, and 3D geometry. However, the accurate annotation of these information is hard to obtain, making it challenging to predict accurate 3D human pose from multi-view images. To deal with this issue, we propose a fully self-supervised framework, named cascaded multi-view aggregating network (CMANet), to construct a canonical parameter space to holistically integrate and exploit multi-view information. In our framework, the multi-view information is grouped into two categories: 1) intra-view information , 2) inter-view information. Accordingly, CMANet consists of two components: intra-view module (IRV) and inter-view module (IEV). IRV is used for extracting initial camera pose and 3D human pose of each view; IEV is to fuse complementary pose information and cross-view 3D geometry for a final 3D human pose. To facilitate the aggregation of the intra- and inter-view, we define a canonical parameter space, depicted by per-view camera pose and human pose and shape parameters ($\theta$ and $\beta$) of SMPL model, and propose a two-stage learning procedure. At first stage, IRV learns to estimate camera pose and view-dependent 3D human pose supervised by confident output of an off-the-shelf 2D keypoint detector. At second stage, IRV is frozen and IEV further refines the camera pose and optimizes the 3D human pose by implicitly encoding the cross-view complement and 3D geometry constraint, achieved by jointly fitting predicted multi-view 2D keypoints. The proposed framework, modules, and learning strategy are demonstrated to be effective by comprehensive experiments and CMANet is superior to state-of-the-art methods in extensive quantitative and qualitative analysis.

Learning Canonical Shape Space for Category-Level 6D Object Pose and Size Estimation

Learning Stereopsis from Geometric Synthesis for 6D Object Pose Estimation

3D Point-to-Keypoint Voting Network for 6D Pose Estimation

Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation

Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation

SAR-Net: Shape Alignment and Recovery Network for Category-level 6D Object Pose and Size Estimation

SOCS: Semantically-aware Object Coordinate Space for Category-Level 6D Object Pose Estimation under Large Shape Variations

KGNet: Knowledge-Guided Networks for Category-Level 6D Object Pose and Size Estimation.

Self-Supervised Geometric Correspondence for Category-Level 6D Object Pose Estimation in the Wild

Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos

Generative Category-Level Shape and Pose Estimation with Semantic Primitives

[Correlation of clinical, hemodynamic and biological data at the acute stage of myocardial infarction].

DualPoseNet: Category-level 6D Object Pose and Size Estimation Using Dual Pose Network with Refined Learning of Pose Consistency

DONet: Learning Category-Level 6D Object Pose and Size Estimation from Depth Observation

Learning Geometric Consistency and Discrepancy for Category-Level 6D Object Pose Estimation from Point Clouds

Self-learning Canonical Space for Multi-view 3D Human Pose Estimation

Leveraging SE(3) Equivariance for Self-Supervised Category-Level Object Pose Estimation

Object Level Depth Reconstruction for Category Level 6D Object Pose Estimation from Monocular RGB Image

CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation

RGB-based Category-level Object Pose Estimation via Decoupled Metric Scale Recovery