Abstract:Object pose estimation and camera localization are critical in various applications. However, achieving algorithm universality, which refers to category-level pose estimation and scene-independent camera localization, presents challenges for both techniques. Although the two tasks keep close relationships due to spatial geometry constraints, different tasks require distinct feature extractions. This paper pays attention to a unified RGB-D based framework that simultaneously performs category-level object pose estimation and scene-independent camera localization. The framework consists of a pose estimation branch called SLO-ObjNet, a localization branch called SLO-LocNet, a pose confidence calculation process and object-level optimization. At the start, we obtain the initial camera and object results from SLO-LocNet and SLO-ObjNet. In these two networks, we design there-level feature fusion modules as well as the loss function to achieve feature sharing between two tasks. Then the proposed approach involves a confidence calculation process to determine the accuracy of object poses obtained. Additionally, an object-level Bundle Adjustment (BA) optimization algorithm is further used to improve the precision of these techniques. The BA algorithm establishes relationships among feature points, objects, and cameras with the usage of camera-point, camera-object, and object-point metrics. To evaluate the performance of this approach, experiments are conducted on localization and pose estimation datasets including REAL275, CAMERA25, LineMOD, YCB-Video, 7 Scenes, ScanNet and TUM RGB-D. The results show that this approach outperforms existing methods in terms of both estimation and localization accuracy. Additionally, SLO-LocNet and SLO-ObjNet are trained on ScanNet data and tested on 7 Scenes and TUM RGB-D datasets to demonstrate its universality performance. Finally, we also highlight the positive effects of fusion modules, loss function, confidence process and BA for improving overall performance.

Recurrent Volume-Based 3-D Feature Fusion for Real-Time Multiview Object Pose Estimation.

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion

Temporal Consistent Object Pose Estimation from Monocular Videos

Zero-Shot 3d Pose Estimation of Unseen Object by Two-Step Rgb-D Fusion

Towards Two-view 6D Object Pose Estimation: A Comparative Study on Fusion Strategy

PA-Pose: Partial Point Cloud Fusion Based on Reliable Alignment for 6D Pose Tracking

RGB-Fusion: Monocular 3D reconstruction with learned depth prediction

MixedFusion: 6D Object Pose Estimation from Decoupled RGB-Depth Features.

FEIF: Feature Excitation and Interactive Fusion for 6D Object Pose Estimation.

Video object matching across multiple non-overlapping camera views based on multi-feature fusion and incremental learning.

DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion

Cascaded Multi-3D-view Fusion for 3D-Oriented Object Detection

Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond

A Transformer-based multi-modal fusion network for 6D pose estimation

A Lightweight Color and Geometry Feature Extraction and Fusion Module for End-to-end 6D Pose Estimation

A Pose Estimation Algorithm for Multimodal Data Fusion

RobustFusion: Robust Volumetric Performance Reconstruction under Human-object Interactions from Monocular RGBD Stream

RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo

Efficient Bi-manipulation using RGBD Multi-model Fusion based on Attention Mechanism

Multi-level feature fusion and joint refinement for simultaneous object pose estimation and camera localization