Abstract:Object Goal Navigation(ObjectNav) is the task that an agent need navigate to an instance of a specific category in an unseen environment through visual observations within limited time steps. This work plays a significant role in enhancing the efficiency of locating specific items in indoor spaces and assisting individuals in completing various tasks, as well as providing support for people with disabilities. To achieve efficient ObjectNav in unfamiliar environments, global perception capabilities, understanding the regularities of space and semantics in the environment layout are significant. In this work, we propose an explicit-prediction method called VLAI that utilizes visual-language alignment information to guide the agent's exploration, unlike previous navigation methods based on frontier potential prediction or egocentric map completion, which only leverage visual observations to construct semantic maps, thus failing to help the agent develop a better global perception. Specifically, when predicting long-term goals, we retrieve previously saved visual observations to obtain visual information around the frontiers based on their position on the incrementally built incomplete semantic map. Then, we apply our designed Chat Describer to this visual information to obtain detailed frontier object descriptions. The Chat Describer, a novel automatic-questioning approach deployed in Visual-to-Language, is composed of Large Language Model(LLM) and the visual-to-language model(VLM), which has visual question-answering functionality. In addition, we also obtain the semantic similarity of target object and frontier object categories. Ultimately, by combining the semantic similarity and the boundary descriptions, the agent can predict the long-term goals more accurately. Our experiments on the Gibson and HM3D datasets reveal that our VLAI approach yields significantly better results compared to earlier methods. The code is released at https://github.com/31539lab/VLAI .

Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation

Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation

An Environmental Perception and Navigational Assistance System for Visually Impaired Persons Based on Semantic Stixels and Sound Interaction

Learning Navigational Visual Representations with Semantic Map Supervision

Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial Environmental Representations for Vision-and-Language Navigation

Sound Adversarial Audio-Visual Navigation

Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

Audio Visual Language Maps for Robot Navigation

A Dual Semantic-Aware Recurrent Global-Adaptive Network For Vision-and-Language Navigation

Visual Representations for Semantic Target Driven Navigation

StereoNavNet: Learning to Navigate using Stereo Cameras with Auxiliary Occupancy Voxels

Omnidirectional Information Gathering for Knowledge Transfer-based Audio-Visual Navigation

Vision-and-Language Navigation via Latent Semantic Alignment Learning

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

Pay Self-Attention to Audio-Visual Navigation

VLAI: Exploration and exploitation based on visual-language aligned information for robotic object goal navigation

Visual Perception Generalization for Vision-and-Language Navigation via Meta-Learning

Vision and Language Navigation in the Real World via Online Visual Language Mapping

Context vector-based visual mapless navigation in indoor using hierarchical semantic information and meta-learning

Towards Versatile Embodied Navigation