Abstract:Object pose estimation and camera localization are critical in various applications. However, achieving algorithm universality, which refers to category-level pose estimation and scene-independent camera localization, presents challenges for both techniques. Although the two tasks keep close relationships due to spatial geometry constraints, different tasks require distinct feature extractions. This paper pays attention to a unified RGB-D based framework that simultaneously performs category-level object pose estimation and scene-independent camera localization. The framework consists of a pose estimation branch called SLO-ObjNet, a localization branch called SLO-LocNet, a pose confidence calculation process and object-level optimization. At the start, we obtain the initial camera and object results from SLO-LocNet and SLO-ObjNet. In these two networks, we design there-level feature fusion modules as well as the loss function to achieve feature sharing between two tasks. Then the proposed approach involves a confidence calculation process to determine the accuracy of object poses obtained. Additionally, an object-level Bundle Adjustment (BA) optimization algorithm is further used to improve the precision of these techniques. The BA algorithm establishes relationships among feature points, objects, and cameras with the usage of camera-point, camera-object, and object-point metrics. To evaluate the performance of this approach, experiments are conducted on localization and pose estimation datasets including REAL275, CAMERA25, LineMOD, YCB-Video, 7 Scenes, ScanNet and TUM RGB-D. The results show that this approach outperforms existing methods in terms of both estimation and localization accuracy. Additionally, SLO-LocNet and SLO-ObjNet are trained on ScanNet data and tested on 7 Scenes and TUM RGB-D datasets to demonstrate its universality performance. Finally, we also highlight the positive effects of fusion modules, loss function, confidence process and BA for improving overall performance.

LHFF-Net: Local heterogeneous feature fusion network for 6DoF pose estimation

MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

FDN: Feature Decoupling Network for Head Pose Estimation.

Recurrent Volume-Based 3-D Feature Fusion for Real-Time Multiview Object Pose Estimation.

RFFCE: Residual Feature Fusion and Confidence Evaluation Network for 6dof Pose Estimation.

FEIF: Feature Excitation and Interactive Fusion for 6D Object Pose Estimation.

PA-Pose: Partial Point Cloud Fusion Based on Reliable Alignment for 6D Pose Tracking

MixedFusion: 6D Object Pose Estimation from Decoupled RGB-Depth Features.

HFF6D: Hierarchical Feature Fusion Network for Robust 6D Object Pose Tracking

A Transformer-based multi-modal fusion network for 6D pose estimation

MLFNet: Monocular lifting fusion network for 6DoF texture-less object pose estimation

A Lightweight Color and Geometry Feature Extraction and Fusion Module for End-to-end 6D Pose Estimation

DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion

Mitigating imbalances in heterogeneous feature fusion for multi-class 6D pose estimation

Zero-Shot 3d Pose Estimation of Unseen Object by Two-Step Rgb-D Fusion

Multi-level feature fusion and joint refinement for simultaneous object pose estimation and camera localization

Fusion-Competition Framework of Local Topology and Global Texture for Head Pose Estimation

6-DoF grasp estimation method that fuses RGB-D data based on external attention

Robust Classification and 6D Pose Estimation by Sensor Dual Fusion of Image and Point Cloud Data

A modal fusion network with dual attention mechanism for 6D pose estimation