Abstract:This work explores the capacity of large language models (LLMs) to address problems at the intersection of spatial planning and natural language interfaces for navigation. We focus on following complex instructions that are more akin to natural conversation than traditional explicit procedural directives typically seen in robotics. Unlike most prior work where navigation directives are provided as simple imperative commands (e.g., "go to the fridge"), we examine implicit directives obtained through conversational interactions.We leverage the 3D simulator AI2Thor to create household query scenarios at scale, and augment it by adding complex language queries for 40 object types. We demonstrate that a robot using our method CARTIER (Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots) can parse descriptive language queries up to 42% more reliably than existing LLM-enabled methods by exploiting the ability of LLMs to interpret the user interaction in the context of the objects in the scenario.

What problem does this paper attempt to address?

The paper aims to address the problem of enabling robots to understand and execute complex, natural language instructions, particularly those with implicit directives that appear in everyday conversations. Unlike traditional simple commands (such as "go to the fridge"), the paper focuses on more complex and conversational instructions, such as those expressed through multi-turn dialogues or descriptive sentences. To solve this problem, the authors developed a method called CARTIER (Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots). This method leverages the capabilities of large language models (LLMs) to parse and understand these complex language queries and convert them into specific navigation instructions. Specifically, CARTIER first uses a pre-trained object detector to identify objects in the environment, then combines this information with the user's query, using LLMs to infer the user's true intent and determine the specific location the robot should go to. To validate the effectiveness of CARTIER, the authors created a series of scenarios in an extended AI2Thor simulation environment, which included 40 different types of objects and 3 different types of queries (explicit, implicit, and conversational). The experimental results showed that CARTIER performed excellently in handling complex queries, especially in dealing with conversational queries, demonstrating a significant advantage over other methods. Additionally, the paper introduces a real-world deployment example, showcasing that CARTIER can successfully navigate to the target location based on conversational instructions from users in a real-world environment. This indicates that CARTIER can effectively enhance a robot's ability to understand and execute natural language instructions, thereby improving the quality of human-robot interaction.

CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots

Verifiably Following Complex Robot Instructions with Foundation Models

Constrained Robotic Navigation on Preferred Terrains Using LLMs and Speech Instruction: Exploiting the Power of Adverbs

NARRATE: Versatile Language Architecture for Optimal Control in Robotics

Using Language to Generate State Abstractions for Long-Range Planning in Outdoor Environments

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

Co-NavGPT: Multi-Robot Cooperative Visual Semantic Navigation using Large Language Models

Intelligent LiDAR Navigation: Leveraging External Information and Semantic Maps with LLM as Copilot

Embodied AI in Mobile Robots: Coverage Path Planning with Large Language Models

Open-vocabulary Queryable Scene Representations for Real World Planning

Language and Sketching: An LLM-driven Interactive Multimodal Multitask Robot Navigation Framework

Speech-Guided Sequential Planning for Autonomous Navigation using Large Language Model Meta AI 3 (Llama3)

Tag Map: A Text-Based Map for Spatial Reasoning and Navigation with Large Language Models

Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation

L3MVN: Leveraging Large Language Models for Visual Target Navigation

Natural Language as Policies: Reasoning for Coordinate-Level Embodied Control with LLMs

Lana: A Language-Capable Navigator for Instruction Following and Generation

Task and Motion Planning with Large Language Models for Object Rearrangement

Pragmatic Instruction Following and Goal Assistance via Cooperative Language-Guided Inverse Planning