Abstract:Recent advances in mapping techniques have enabled the creation of highly accurate dense 3D maps during robotic missions, such as point clouds, meshes, or NeRF-based representations. These developments present new opportunities for reusing these maps for localization. However, there remains a lack of a unified approach that can operate seamlessly across different map representations. This paper presents and evaluates a global visual localization system capable of localizing a single camera image across various 3D map representations built using both visual and lidar sensing. Our system generates a database by synthesizing novel views of the scene, creating RGB and depth image pairs. Leveraging the precise 3D geometric map, our method automatically defines rendering poses, reducing the number of database images while preserving retrieval performance. To bridge the domain gap between real query camera images and synthetic database images, our approach utilizes learning-based descriptors and feature detectors. We evaluate the system's performance through extensive real-world experiments conducted in both indoor and outdoor settings, assessing the effectiveness of each map representation and demonstrating its advantages over traditional structure-from-motion (SfM) localization approaches. The results show that all three map representations can achieve consistent localization success rates of 55% and higher across various environments. NeRF synthesized images show superior performance, localizing query images at an average success rate of 72%. Furthermore, we demonstrate an advantage over SfM-based approaches that our synthesized database enables localization in the reverse travel direction which is unseen during the mapping process. Our system, operating in real-time on a mobile laptop equipped with a GPU, achieves a processing rate of 1Hz.

ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language

SemSCo: Semantic Frequency Domain Scan Context for LiDAR-Based Place Recognition.

Multi3DRefer: Grounding Text Description to Multiple 3D Objects

SSC: Semantic Scan Context for Large-Scale Place Recognition

3D LiDAR-Based Global Localization Using Siamese Neural Network

Non-local Scan Consolidation for 3D Urban Scenes.

LocNet: Global Localization in 3D Point Clouds for Mobile Robots.

FreSCo: Frequency-Domain Scan Context for LiDAR-based Place Recognition with Translation and Rotation Invariance

ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

Scan Context: Egocentric Spatial Descriptor for Place Recognition Within 3D Point Cloud Map

WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale Point Cloud Data

Object Scan Context: Object-centric Spatial Descriptor for Place Recognition within 3D Point Cloud Map

Scanet: Spatial-Channel Attention Network For 3d Object Detection

RefMask3D: Language-Guided Transformer for 3D Referring Segmentation

RevealNet: Seeing Behind Objects in RGB-D Scans

Search3D: Hierarchical Open-Vocabulary 3D Segmentation

Scan2CAD: Learning CAD Model Alignment in RGB-D Scans

ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding

Visual Localization in 3D Maps: Comparing Point Cloud, Mesh, and NeRF Representations