Abstract:Object Goal Navigation(ObjectNav) is the task that an agent need navigate to an instance of a specific category in an unseen environment through visual observations within limited time steps. This work plays a significant role in enhancing the efficiency of locating specific items in indoor spaces and assisting individuals in completing various tasks, as well as providing support for people with disabilities. To achieve efficient ObjectNav in unfamiliar environments, global perception capabilities, understanding the regularities of space and semantics in the environment layout are significant. In this work, we propose an explicit-prediction method called VLAI that utilizes visual-language alignment information to guide the agent's exploration, unlike previous navigation methods based on frontier potential prediction or egocentric map completion, which only leverage visual observations to construct semantic maps, thus failing to help the agent develop a better global perception. Specifically, when predicting long-term goals, we retrieve previously saved visual observations to obtain visual information around the frontiers based on their position on the incrementally built incomplete semantic map. Then, we apply our designed Chat Describer to this visual information to obtain detailed frontier object descriptions. The Chat Describer, a novel automatic-questioning approach deployed in Visual-to-Language, is composed of Large Language Model(LLM) and the visual-to-language model(VLM), which has visual question-answering functionality. In addition, we also obtain the semantic similarity of target object and frontier object categories. Ultimately, by combining the semantic similarity and the boundary descriptions, the agent can predict the long-term goals more accurately. Our experiments on the Gibson and HM3D datasets reveal that our VLAI approach yields significantly better results compared to earlier methods. The code is released at https://github.com/31539lab/VLAI .

ChatNav: Leveraging LLM to Zero-shot Semantic Reasoning in Object Navigation

OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models

VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language Model

SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation

VLAI: Exploration and exploitation based on visual-language aligned information for robotic object goal navigation

L3MVN: Leveraging Large Language Models for Visual Target Navigation

Open-Nav: Exploring Zero-Shot Vision-and-Language Navigation in Continuous Environment with Open-Source LLMs

ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

Reliable Semantic Understanding for Real World Zero-shot Object Goal Navigation

TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation

Zero-shot Object Navigation with Vision-Language Models Reasoning

Discuss Before Moving: Visual Language Navigation via Multi-expert Discussions

Intelligent LiDAR Navigation: Leveraging External Information and Semantic Maps with LLM as Copilot

Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration for Zero-Shot Object Navigation

$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

Advancing Object Goal Navigation Through LLM-enhanced Object Affinities Transfer

CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation

Think, Act, and Ask: Open-World Interactive Personalized Robot Navigation

VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models

Navigation with VLM framework: Go to Any Language