Interactive Navigation in Environments with Traversable Obstacles Using Large Language and Vision-Language Models

Zhen Zhang,Anran Lin,Chun Wai Wong,Xiangyu Chu,Qi Dou,K. W. Samuel Au

2024-03-13

Abstract:This paper proposes an interactive navigation framework by using large language and vision-language models, allowing robots to navigate in environments with traversable obstacles. We utilize the large language model (GPT-3.5) and the open-set Vision-language Model (Grounding DINO) to create an action-aware costmap to perform effective path planning without fine-tuning. With the large models, we can achieve an end-to-end system from textual instructions like "Can you pass through the curtains to deliver medicines to me?", to bounding boxes (e.g., curtains) with action-aware attributes. They can be used to segment LiDAR point clouds into two parts: traversable and untraversable parts, and then an action-aware costmap is constructed for generating a feasible path. The pre-trained large models have great generalization ability and do not require additional annotated data for training, allowing fast deployment in the interactive navigation tasks. We choose to use multiple traversable objects such as curtains and grasses for verification by instructing the robot to traverse them. Besides, traversing curtains in a medical scenario was tested. All experimental results demonstrated the proposed framework's effectiveness and adaptability to diverse environments.

Robotics,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve interactive navigation of robots in environments containing traversable obstacles (such as curtains, grasslands, etc.). Specifically, the paper proposes an interactive navigation framework based on large - language models (such as GPT - 3.5) and vision - language models (such as Grounding DINO), enabling robots to plan feasible paths according to human natural - language instructions (for example, "Can you pass through the curtain to bring me medicine?") and navigate in these environments. Traditional navigation systems usually regard all obstacles as non - traversable, which limits the flexibility and adaptability of robots. By introducing action - aware attributes, this framework can distinguish between traversable and non - traversable obstacles, thereby enhancing the robot's navigation ability in complex environments. The main contributions of the paper include: 1. Proposing an interactive navigation framework based on pre - trained large models, enabling robots to plan feasible paths in environments containing traversable objects. 2. Extracting action - aware attributes from text instructions in addition to landmarks to assist in sensor data segmentation and construct action - aware cost maps. 3. Experimentally verifying the effectiveness and generalization ability of the proposed framework in different traversable objects and scenarios.

Interactive Navigation in Environments with Traversable Obstacles Using Large Language and Vision-Language Models

L3MVN: Leveraging Large Language Models for Visual Target Navigation

Vision and Language Navigation in the Real World via Online Visual Language Mapping

Traversability-Aware Legged Navigation by Learning from Real-World Visual Data

ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

Co-NavGPT: Multi-Robot Cooperative Visual Semantic Navigation using Large Language Models

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

Interactive-FAR:Interactive, Fast and Adaptable Routing for Navigation Among Movable Obstacles in Complex Unknown Environments

AdaVLN: Towards Visual Language Navigation in Continuous Indoor Environments with Moving Humans

Robot Navigation Using Physically Grounded Vision-Language Models in Outdoor Environments

Language and Sketching: An LLM-driven Interactive Multimodal Multitask Robot Navigation Framework

Navigating Beyond Instructions: Vision-and-Language Navigation in Obstructed Environments

Where to Fetch: Extracting Visual Scene Representation from Large Pre-Trained Models for Robotic Goal Navigation

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

Multimodal Large Language Model for Visual Navigation

Visual Language Maps for Robot Navigation

Safe and Robust Mobile Robot Navigation in Uneven Indoor Environments

Active Visual Information Gathering for Vision-Language Navigation

ImagineNav: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

IN-Sight: Interactive Navigation through Sight