Abstract:Understanding how humans leverage semantic knowledge to navigate unfamiliar environments and decide where to explore next is pivotal for developing robots capable of human-like search behaviors. We introduce a zero-shot navigation approach, Vision-Language Frontier Maps (VLFM), which is inspired by human reasoning and designed to navigate towards unseen semantic objects in novel environments. VLFM builds occupancy maps from depth observations to identify frontiers, and leverages RGB observations and a pre-trained vision-language model to generate a language-grounded value map. VLFM then uses this map to identify the most promising frontier to explore for finding an instance of a given target object category. We evaluate VLFM in photo-realistic environments from the Gibson, Habitat-Matterport 3D (HM3D), and Matterport 3D (MP3D) datasets within the Habitat simulator. Remarkably, VLFM achieves state-of-the-art results on all three datasets as measured by success weighted by path length (SPL) for the Object Goal Navigation task. Furthermore, we show that VLFM's zero-shot nature enables it to be readily deployed on real-world robots such as the Boston Dynamics Spot mobile manipulation platform. We deploy VLFM on Spot and demonstrate its capability to efficiently navigate to target objects within an office building in the real world, without any prior knowledge of the environment. The accomplishments of VLFM underscore the promising potential of vision-language models in advancing the field of semantic navigation. Videos of real-world deployment can be viewed at <a class="link-external link-http" href="http://naoki.io/vlfm" rel="external noopener nofollow">this http URL</a>.

One Map to Find Them All: Real-time Open-Vocabulary Mapping for Zero-shot Multi-Object Navigation

OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models

MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation

ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation

VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language Model

Object-aware Semantic Mapping of Indoor Scenes Using Octomap

Zero-shot Object Navigation with Vision-Language Foundation Models Reasoning.

Object-Aware Hybrid Map for Indoor Robot Visual Semantic Navigation

MultiON: Benchmarking Semantic Map Memory using Multi-Object Navigation

Reliable Semantic Understanding for Real World Zero-shot Object Goal Navigation

Balancing Performance and Efficiency in Zero-shot Robotic Navigation

Bridging Zero-shot Object Navigation and Foundation Models through Pixel-Guided Navigation Skill

Object-Oriented Semantic Mapping for Reliable UAVs Navigation

ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation

GAMap: Zero-Shot Object Goal Navigation with Multi-Scale Geometric-Affordance Guidance

Real-Time Metric-Semantic Mapping for Autonomous Navigation in Outdoor Environments

VLN-Game: Vision-Language Equilibrium Search for Zero-Shot Semantic Navigation

Think, Act, and Ask: Open-World Interactive Personalized Robot Navigation

Target-driven multi-input mapless robot navigation with deep reinforcement learning