Abstract:Camera relocalization methods range from dense image alignment to direct camera pose regression from a query image. Among these, sparse feature matching stands out as an efficient, versatile, and generally lightweight approach with numerous applications. However, feature-based methods often struggle with significant viewpoint and appearance changes, leading to matching failures and inaccurate pose estimates. To overcome this limitation, we propose a novel approach that leverages a globally sparse yet locally dense 3D representation of 2D features. By tracking and triangulating landmarks over a sequence of frames, we construct a sparse voxel map optimized to render image patch descriptors observed during tracking. Given an initial pose estimate, we first synthesize descriptors from the voxels using volumetric rendering and then perform feature matching to estimate the camera pose. This methodology enables the generation of descriptors for unseen views, enhancing robustness to view changes. We extensively evaluate our method on the 7-Scenes and Cambridge Landmarks datasets. Our results show that our method significantly outperforms existing state-of-the-art feature representation techniques in indoor environments, achieving up to a 39% improvement in median translation error. Additionally, our approach yields comparable results to other methods for outdoor scenarios while maintaining lower memory and computational costs.

What problem does this paper attempt to address?

This paper attempts to solve the key problems in visual relocalization (camera relocalization), especially when facing significant view and appearance changes. Traditional feature - based methods often perform poorly, leading to matching failures and inaccurate pose estimation. Specifically: - **Problem Background**: Existing visual relocalization methods have limitations when dealing with significant view and appearance changes, especially in terms of feature matching. Although traditional feature - based methods are efficient and lightweight, they are prone to matching failures or inaccurate pose estimation in these cases. - **Shortcomings of Existing Methods**: - Dense descriptor representation methods (such as NeRF - based methods) improve performance but require more training time and memory resources. - Sparse descriptor synthesis methods have difficulty in rendering high - dimensional descriptors, which limits their application range. - **Solution Proposed in the Paper**: To overcome these limitations, the authors propose FaVoR (Features via Voxel Rendering), a new feature - rendering method. This method uses a pre - trained neural network to extract robust features and encodes and renders feature descriptors in 3D space through sparse voxel representation. The main features of FaVoR include: - **Globally Sparse but Locally Dense 3D Representation**: By tracking and triangulating feature points in multiple frames, a sparse voxel map is constructed and optimized to render the observed image patch descriptors. - **3D Point Descriptor Extraction under View Conditions**: Efficiently extract descriptors from any query camera pose. - **Low Resource Consumption**: Compared with other methods, FaVoR reduces the computational burden and improves scalability. - **Main Contributions**: - A sparse voxel algorithm that does not require learning a dense volumetric scene representation is proposed. - It shows how to render high - dimensional descriptors, providing better view invariance. - Experiments on the 7 - Scenes and Cambridge Landmarks datasets show that FaVoR significantly outperforms existing implicit feature - rendering methods, reducing the median translation error by up to 39% in indoor environments. In summary, this paper aims to solve the limitations of existing visual relocalization techniques when dealing with significant view and appearance changes by introducing the FaVoR method, providing a more efficient and robust solution.

FaVoR: Features via Voxel Rendering for Camera Relocalization

Leveraging Local Planar Motion Property for Robust Visual Matching and Localization.

2-Entity Random Sample Consensus for Robust Visual Localization: Framework, Methods, and Verifications

FreSCo: Frequency-Domain Scan Context for LiDAR-based Place Recognition with Translation and Rotation Invariance

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

Recurrent Volume-Based 3-D Feature Fusion for Real-Time Multiview Object Pose Estimation.

CFVL: A Coarse-to-Fine Vehicle Localizer with Omnidirectional Perception Across Severe Appearance Variations

Local Optimized and Scalable Frame-to-model SLAM

Local Supports Global: Deep Camera Relocalization With Sequence Enhancement

Sparse-to-Dense Hypercolumn Matching for Long-Term Visual Localization

Voxel Map for Visual SLAM

SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments

PixSelect: Less but Reliable Pixels for Accurate and Efficient Localization

Voxelized 3D Feature Aggregation for Multiview Detection

VRS-NeRF: Visual Relocalization with Sparse Neural Radiance Field

Novel 3D local feature descriptor of point clouds based on spatial voxel homogenization for feature matching

Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer

Real-time Image-based 6-DOF Localization in Large-Scale Environments

Efficient 2D-3D Matching for Multi-Camera Visual Localization

Fast and robust active camera relocalization in the wild for fine-grained change detection

Scale-Consistent Fusion: From Heterogeneous Local Sampling to Global Immersive Rendering