Abstract:Estimating the 6DoF pose of objects in complex scenarios is one of the core challenges in environmental perception for unmanned systems. Recently, mesh-free pose estimation methods based on “inverse” NeRF have achieved state-of-the-art (SOTA) accuracy under ideal data conditions compared to traditional methods. However, the overall performance of this strategy is suboptimal due to some overlooked details, such as NeRF’s sampling of pixel backpropagation, which can lead to local minima in high-resolution images. Random initialization of poses increases the difficulty of network convergence and estimation bias. Pose estimation neglects geometric consistency constraints, resulting in low robustness to occluded environments. To address these issues, this paper proposes a “coarse-to-fine” NeRF pose prediction framework (C2Fi-NeRF). During the training phase, an affinity-based full-pixel backpropagation strategy is proposed, abandoning the sparse sampling of traditional NeRF training. The complete gradient map is divided into affinity blocks, which are rendered and backpropagated in sequence. This not only achieves efficient full-pixel training for high-resolution images but also significantly improves the quality and consistency of rendered images, reducing noise and artifacts. The prediction phase is divided into two parts: the coarse phase optimizes reprojection errors through feature point matching to introduce precise initialization data, accelerating NeRF convergence and reducing bias potential. The fine estimation phase integrates multi-view geometry and color consistency constraints in the inverse NeRF iteration, enhancing pixel rendering robustness in complex occluded scenarios while also refining pose prediction. Experiments demonstrate that compared to existing NeRF-based and mainstream deep learning methods, C2Fi-NeRF is competitive in accuracy and efficiency on relevant datasets (NeRF-Synthetic, LLFF, Replica, YCB) and is more suitable for practical robotic applications.

C2FNet: Coarse-to-Fine Keypoint Localization Network for Monocular 6D Object Pose Estimation

3D Point-to-Keypoint Voting Network for 6D Pose Estimation

RFFCE: Residual Feature Fusion and Confidence Evaluation Network for 6dof Pose Estimation.

KDFNet: Learning Keypoint Distance Field for 6D Object Pose Estimation

DCNet: Dense Correspondence Neural Network for 6DoF Object Pose Estimation in Occluded Scenes

A Coarse-Fine Network for Keypoint Localization

BiCo-Net: Regress Globally, Match Locally for Robust 6D Pose Estimation

A Lightweight Two-End Feature Fusion Network for Object 6D Pose Estimation

Exploring Multiple Geometric Representations for 6DoF Object Pose Estimation

A dynamic keypoint selection network for 6DoF pose estimation

A 3D Keypoints Voting Network for 6DoF Pose Estimation in Indoor Scene

MFPN-6D: Real-time One-stage Pose Estimation of Objects on RGB Images

HFE-Net: Hierarchical Feature Extraction and Coordinate Conversion of Point Cloud for Object 6D Pose Estimation

C2Fi-NeRF: Coarse to Fine Inversion NeRF for 6D Pose Estimation

REG-Net: Improving 6DoF Object Pose Estimation with 2D Keypoint Long-Short-Range-Aware Registration

DGECN++: A Depth-Guided Edge Convolutional Network for End-to-End 6D Pose Estimation via Attention Mechanism

A Dynamic Keypoints Selection Network for 6DoF Pose Estimation

MLFNet: Monocular lifting fusion network for 6DoF texture-less object pose estimation

RFF-PoseNet: A 6D Object Pose Estimation Network Based on Robust Feature Fusion in Complex Scenes

SaMfENet: Self-Attention Based Multi-Scale Feature Fusion Coding and Edge Information Constraint Network for 6D Pose Estimation