VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language Model

Pengying Wu,Yao Mu,Bingxian Wu,Yi Hou,Ji Ma,Shanghang Zhang,Chang Liu
2024-02-06
Abstract:In the realm of household robotics, the Zero-Shot Object Navigation (ZSON) task empowers agents to adeptly traverse unfamiliar environments and locate objects from novel categories without prior explicit training. This paper introduces VoroNav, a novel semantic exploration framework that proposes the Reduced Voronoi Graph to extract exploratory paths and planning nodes from a semantic map constructed in real time. By harnessing topological and semantic information, VoroNav designs text-based descriptions of paths and images that are readily interpretable by a large language model (LLM). In particular, our approach presents a synergy of path and farsight descriptions to represent the environmental context, enabling LLM to apply commonsense reasoning to ascertain waypoints for navigation. Extensive evaluation on HM3D and HSSD validates VoroNav surpasses existing benchmarks in both success rate and exploration efficiency (absolute improvement: +2.8% Success and +3.7% SPL on HM3D, +2.6% Success and +3.8% SPL on HSSD). Additionally introduced metrics that evaluate obstacle avoidance proficiency and perceptual efficiency further corroborate the enhancements achieved by our method in ZSON planning. Project page:
Robotics,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve zero - shot object navigation (ZSON) in the field of domestic robots. Specifically, the paper aims to develop a method that enables robots to find target objects of new categories in unfamiliar and unseen environments without prior explicit training on these objects. The core of this task lies in using general common sense to guide robots to efficiently explore the environment with the minimum movement cost and accurately locate new target objects. To achieve this goal, the paper proposes VoroNav, a Voronoi - diagram - based semantic exploration framework that constructs real - time semantic maps by extracting exploration paths and planning nodes. VoroNav combines topological and semantic information and devises methods for text - describing paths and images, and these descriptions can be understood by large - language models (LLMs). In particular, VoroNav represents the environmental context through the combination of paths and long - sight descriptions, enabling LLMs to apply common - sense reasoning to determine navigation waypoints. The main contributions of the paper include: 1. Introducing a Voronoi - based scene - graph - generation method for selecting waypoints that provide rich observational data to facilitate the subsequent planning process. 2. Designing an innovative scene - representation - prompt strategy that combines paths and long - sight descriptions to provide LLMs with a comprehensive scene description for analysis and evaluation. 3. Proposing a decision - making policy that requires a trade - off between exploration, path efficiency, and common - sense inclination to generate reasonable actions. 4. Achieving state - of - the - art results on the representative datasets HM3D and HSSD, surpassing the existing benchmark methods. Through the above methods, VoroNav not only improves the success rate of navigation and exploration efficiency but also achieves significant improvements in obstacle - avoidance ability and perception efficiency.