Abstract:This paper addresses the challenge of 6DoF texture-less object pose estimation from a single RGB image. Many recent works have shown that two-stage deep learning approaches based on the fusion of 2D geometric intermediate representations achieve remarkable results. These methods implicitly explore the mapping from the 2D appearance domain to the 3D structure domain. However, due to the lack of 3D geometric constraints from depth maps, it is difficult to extract enough clues based on appearance features to master the geometric relation of projection from 3D viewpoints to 2D planes, and this estimation process is extremely sensitive to occlusion. We propose a novel network called MLFNet that lifts the feature space from 2D to 3D based on hybrid 3D geometric intermediate representations. For the first time, we propose the surface normals in the object coordinate system as an intermediate representation of pose; its violent change provides strong clues for the keypoints usually located at the abrupt change of object surface. Dense 3D surfaces can enhance the geometric consistency of multi-representation constraints and retain more information in occluded scenes. With the proposed multi-modality dual attention mechanism and the embedding of standard 3D shape knowledge, the 2D geometric representation learning process explicitly depends on the fusion of 2D appearance features and 3D geometric features. This standardized information fusion pattern among 2D intermediate representations, 3D intermediate representations, and CAD models prior significantly reduces the network learning space. The proposed method achieves competitive performance on the Linemod dataset and outperforms the state-of-the-art methods on the Occlusion Linemod and T-Less datasets, which demonstrates the feasibility of the pose multi-representation fusion technique. The project site is at https://github.com/JJJano/MLFNet.

Attention-Based RGBD Fusenet for Monocular 3D Body Geometry and Pose Reconstruction.

FDN: Feature Decoupling Network for Head Pose Estimation.

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

Recurrent Volume-Based 3-D Feature Fusion for Real-Time Multiview Object Pose Estimation.

Zero-Shot 3d Pose Estimation of Unseen Object by Two-Step Rgb-D Fusion

RGB-Fusion: Monocular 3D reconstruction with learned depth prediction

RFFCE: Residual Feature Fusion and Confidence Evaluation Network for 6dof Pose Estimation.

Robust 3D Reconstruction with an RGB-D Camera

Deep3DPose: Realtime Reconstruction of Arbitrarily Posed Human Bodies from Single RGB Images

Monocular Real-time Full Body Capture with Inter-part Correlations

Marker-Less 3d Human Motion Capture With Monocular Image Sequence And Height-Maps

3D real-time human reconstruction with a single RGBD camera

3D Human Reconstruction from A Single Depth Image

MixedFusion: 6D Object Pose Estimation from Decoupled RGB-Depth Features.

MLFNet: Monocular lifting fusion network for 6DoF texture-less object pose estimation

CrossFuNet: RGB and Depth Cross-Fusion Network for Hand Pose Estimation

Pyramid Deep Fusion Network for Two-Hand Reconstruction from RGB-D Images

Synthetic Depth Transfer for Monocular 3D Object Pose Estimation in the Wild.

A modal fusion network with dual attention mechanism for 6D pose estimation

CMA: Cross-modal Attention for 6D Object Pose Estimation