Open-Set 3D Semantic Instance Maps for Vision Language Navigation -- O3D-SIM

Laksh Nanwani,Kumaraditya Gupta,Aditya Mathur,Swayam Agrawal,A.H. Abdul Hafez,K. Madhava Krishna

2024-04-27

Abstract:Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work SI Maps [1] showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasing the pipeline's robustness and improving quantitative and qualitative results. Our method leverages foundational models for object recognition, image segmentation, and feature extraction. We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query. Quantitatively, the work improves upon the success rate of language-guided tasks. At the same time, we qualitatively observe the ability to identify instances more clearly and leverage the foundational models and language and image-aligned embeddings to identify objects that, otherwise, a closed-set approach wouldn't be able to identify.

Computer Vision and Pattern Recognition,Robotics

What problem does this paper attempt to address?

### The Problems Addressed by This Paper This paper primarily addresses the following issues: 1. **Instance-Level Recognition in Vision-and-Language Navigation (VLN)**: - Current visual language navigation methods have limitations when dealing with specific instance queries, especially in tasks that require recognizing specific object instances and performing spatial reasoning. This paper proposes an Open-Set 3D Semantic Instance Map (O3D-SIM) that can better recognize and distinguish specific object instances. 2. **Extension from 2D to 3D**: - Previous 2D methods are limited in performance when dealing with large objects occluding smaller ones and assume that all categories are predefined. O3D-SIM extends previous 2D methods to 3D, enhancing the understanding of complex environments and enabling the recognition of unseen object instances in an open set. 3. **Open-Set Handling**: - Traditional closed-set methods assume that only predefined objects will be encountered in the environment. O3D-SIM, by leveraging the latest image segmentation techniques and open-set image-language alignment embeddings, can recognize and classify unseen objects in an open set. 4. **Handling Complex Queries**: - O3D-SIM not only improves the success rate of correctly recognizing object instances but also handles complex language navigation queries, including queries for specific object instances that were not seen during training. In summary, this paper aims to significantly enhance object recognition and instance differentiation capabilities in vision-and-language navigation tasks, especially in complex and real-world data, by introducing the Open-Set 3D Semantic Instance Map (O3D-SIM).

Open-Set 3D Semantic Instance Maps for Vision Language Navigation -- O3D-SIM

Open-set 3D semantic instance maps for vision language navigation – O3D-SIM

Instance-Level Semantic Maps for Vision Language Navigation

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

ConceptFusion: Open-set Multimodal 3D Mapping

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Object-Oriented 3D Semantic Mapping Based on Instance Segmentation

Object-aware Semantic Mapping of Indoor Scenes Using Octomap

Volumetric Instance-Aware Semantic Mapping and 3D Object Discovery

Monocular Semantic Mapping Based on 3D Cuboids Tracking.

3D Semantic MapNet: Building Maps for Multi-Object Re-Identification in 3D

OpenSU3D: Open World 3D Scene Understanding using Foundation Models

OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data

Volumetric Semantically Consistent 3D Panoptic Mapping

Object Instance Retrieval in Assistive Robotics: Leveraging Fine-Tuned SimSiam with Multi-View Images Based on 3D Semantic Map

OpenMask3D: Open-Vocabulary 3D Instance Segmentation

Navigating to Objects Specified by Images

Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance

Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation

OpenScene: 3D Scene Understanding with Open Vocabularies