Abstract:Object navigation (ObjectNav) requires an agent to navigate through unseen environments to find queried objects. Many previous methods attempted to solve this task by relying on supervised or reinforcement learning, where they are trained on limited household datasets with close-set objects. However, two key challenges are unsolved: understanding free-form natural language instructions that demand open-set objects, and generalizing to new environments in a zero-shot manner. Aiming to solve the two challenges, in this paper, we propose OpenFMNav, an Open-set Foundation Model based framework for zero-shot object Navigation. We first unleash the reasoning abilities of large language models (LLMs) to extract proposed objects from natural language instructions that meet the user's demand. We then leverage the generalizability of large vision language models (VLMs) to actively discover and detect candidate objects from the scene, building a Versatile Semantic Score Map (VSSM). Then, by conducting common sense reasoning on VSSM, our method can perform effective language-guided exploration and exploitation of the scene and finally reach the goal. By leveraging the reasoning and generalizing abilities of foundation models, our method can understand free-form human instructions and perform effective open-set zero-shot navigation in diverse environments. Extensive experiments on the HM3D ObjectNav benchmark show that our method surpasses all the strong baselines on all metrics, proving our method's effectiveness. Furthermore, we perform real robot demonstrations to validate our method's open-set-ness and generalizability to real-world environments.

ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation

ChatNav: Leveraging LLM to Zero-shot Semantic Reasoning in Object Navigation

Zero-shot Object Navigation with Vision-Language Foundation Models Reasoning.

OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models

SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation

ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

Reliable Semantic Understanding for Real World Zero-shot Object Goal Navigation

Think Holistically, Act Down-to-Earth: A Semantic Navigation Strategy with Continuous Environmental Representation and Multi-step Forward Planning

Bridging Zero-shot Object Navigation and Foundation Models through Pixel-Guided Navigation Skill

VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language Model

Embodied Contrastive Learning with Geometric Consistency and Behavioral Awareness for Object Navigation

CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration

Zero-shot Object Navigation with Vision-Language Models Reasoning

Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration for Zero-Shot Object Navigation

Object Goal Navigation using Goal-Oriented Semantic Exploration

Semantic Policy Network for Zero-Shot Object Goal Visual Navigation.

One Map to Find Them All: Real-time Open-Vocabulary Mapping for Zero-shot Multi-Object Navigation

GAMap: Zero-Shot Object Goal Navigation with Multi-Scale Geometric-Affordance Guidance

Think, Act, and Ask: Open-World Interactive Personalized Robot Navigation

Two-Stage Depth Enhanced Learning with Obstacle Map For Object Navigation