Abstract:Semantic navigation is necessary to deploy mobile robots in uncontrolled environments such as homes or hospitals. Many learning-based approaches have been proposed in response to the lack of semantic understanding of the classical pipeline for spatial navigation, which builds a geometric map using depth sensors and plans to reach point goals. Broadly, end-to-end learning approaches reactively map sensor inputs to actions with deep neural networks, whereas modular learning approaches enrich the classical pipeline with learning-based semantic sensing and exploration. However, learned visual navigation policies have predominantly been evaluated in sim, with little known about what works on a robot. We present a large-scale empirical study of semantic visual navigation methods comparing representative methods with classical, modular, and end-to-end learning approaches across six homes with no prior experience, maps, or instrumentation. We found that modular learning works well in the real world, attaining a 90% success rate. In contrast, end-to-end learning does not, dropping from 77% sim to a 23% real-world success rate because of a large image domain gap between sim and reality. For practitioners, we show that modular learning is a reliable approach to navigate to objects: Modularity and abstraction in policy design enable sim-to-real transfer. For researchers, we identify two key issues that prevent today's simulators from being reliable evaluation benchmarks—a large sim-to-real gap in images and a disconnect between sim and real-world error modes—and propose concrete steps forward.

Learning a Semantic Prior for Guided Navigation

ChatNav: Leveraging LLM to Zero-shot Semantic Reasoning in Object Navigation

Visual Semantic Navigation using Scene Priors

Learning and Planning with a Semantic Model

Semantic Visual Navigation by Watching YouTube Videos

Learning Navigational Visual Representations with Semantic Map Supervision

Knowledge-driven Scene Priors for Semantic Audio-Visual Embodied Navigation

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

Multi-Agent Embodied Visual Semantic Navigation with Scene Prior Knowledge

Visual Representations for Semantic Target Driven Navigation

Navigating to objects in the real world

Prioritized Semantic Learning for Zero-shot Instance Navigation

Vision-and-Language Navigation via Latent Semantic Alignment Learning

Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial Environmental Representations for Vision-and-Language Navigation

Self-Supervised 3-D Semantic Representation Learning for Vision-and-Language Navigation

Object Goal Visual Navigation Using Semantic Spatial Relationships.

Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning

Teaching Agents how to Map: Spatial Reasoning for Multi-Object Navigation

Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation