Abstract:Object pose estimation and camera localization are critical in various applications. However, achieving algorithm universality, which refers to category-level pose estimation and scene-independent camera localization, presents challenges for both techniques. Although the two tasks keep close relationships due to spatial geometry constraints, different tasks require distinct feature extractions. This paper pays attention to a unified RGB-D based framework that simultaneously performs category-level object pose estimation and scene-independent camera localization. The framework consists of a pose estimation branch called SLO-ObjNet, a localization branch called SLO-LocNet, a pose confidence calculation process and object-level optimization. At the start, we obtain the initial camera and object results from SLO-LocNet and SLO-ObjNet. In these two networks, we design there-level feature fusion modules as well as the loss function to achieve feature sharing between two tasks. Then the proposed approach involves a confidence calculation process to determine the accuracy of object poses obtained. Additionally, an object-level Bundle Adjustment (BA) optimization algorithm is further used to improve the precision of these techniques. The BA algorithm establishes relationships among feature points, objects, and cameras with the usage of camera-point, camera-object, and object-point metrics. To evaluate the performance of this approach, experiments are conducted on localization and pose estimation datasets including REAL275, CAMERA25, LineMOD, YCB-Video, 7 Scenes, ScanNet and TUM RGB-D. The results show that this approach outperforms existing methods in terms of both estimation and localization accuracy. Additionally, SLO-LocNet and SLO-ObjNet are trained on ScanNet data and tested on 7 Scenes and TUM RGB-D datasets to demonstrate its universality performance. Finally, we also highlight the positive effects of fusion modules, loss function, confidence process and BA for improving overall performance.

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

Recurrent Volume-Based 3-D Feature Fusion for Real-Time Multiview Object Pose Estimation.

MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion

Temporal Consistent Object Pose Estimation from Monocular Videos

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Zero-Shot 3d Pose Estimation of Unseen Object by Two-Step Rgb-D Fusion

Towards Two-view 6D Object Pose Estimation: A Comparative Study on Fusion Strategy

FEIF: Feature Excitation and Interactive Fusion for 6D Object Pose Estimation.

DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion

MixedFusion: 6D Object Pose Estimation from Decoupled RGB-Depth Features.

RFFCE: Residual Feature Fusion and Confidence Evaluation Network for 6dof Pose Estimation.

PA-Pose: Partial Point Cloud Fusion Based on Reliable Alignment for 6D Pose Tracking

A Transformer-based multi-modal fusion network for 6D pose estimation

RGB-Fusion: Monocular 3D reconstruction with learned depth prediction

A Lightweight Color and Geometry Feature Extraction and Fusion Module for End-to-end 6D Pose Estimation

Robust Classification and 6D Pose Estimation by Sensor Dual Fusion of Image and Point Cloud Data

Multi-level feature fusion and joint refinement for simultaneous object pose estimation and camera localization

Efficient Bi-manipulation using RGBD Multi-model Fusion based on Attention Mechanism

FusionDepth: Complement Self-Supervised Monocular Depth Estimation with Cost Volume

Co-Occ: Coupling Explicit Feature Fusion with Volume Rendering Regularization for Multi-Modal 3D Semantic Occupancy Prediction

RobustFusion: Robust Volumetric Performance Reconstruction under Human-object Interactions from Monocular RGBD Stream