Abstract:The existing methods for 6D pose estimation based on RGB-D employ RGB images and observed point cloud derived from depth maps as input, then concurrently predicting both rotation and translation. However, rotation and translation possess distinct characteristics and scale ranges, and their simultaneous prediction can lead to mutual influence in the network parameter space. Additionally, the observed point cloud are susceptible to systematic noise and partial data loss, presenting challenges for the network to capture comprehensive object features. To address these issues, we propose the Semi-Decoupled 6D pose estimation via multi-modal feature fusion (SD6D). SD6D comprises a Multi-Modal Fusion Module and a Semi-Decoupled Prediction Module. The former dynamically fuses different modal data (RGB, depth, CAD model) based on their inter-modality correlations, aiding in establishing 2D-3D correspondences and addressing issues stemming from systematic noise and partial data loss. The latter semi-decouples the prediction of rotation and translation, predicting them separately based on their distinct characteristics. We conducted experiments on two popular benchmark datasets, which prove the superiority of our method.

Semi-Decoupled 6D Pose Estimation Via Multi-Modal Feature Fusion