Abstract:Large Language Models (LLMs) have recently shown promise as high-level planners for robots when given access to a selection of low-level skills. However, it is often assumed that LLMs do not possess sufficient knowledge to be used for the low-level trajectories themselves. In this work, we address this assumption thoroughly, and investigate if an LLM (GPT-4) can directly predict a dense sequence of end-effector poses for manipulation tasks, when given access to only object detection and segmentation vision models. We designed a single, task-agnostic prompt, without any in-context examples, motion primitives, or external trajectory optimisers. Then we studied how well it can perform across 30 real-world language-based tasks, such as "open the bottle cap" and "wipe the plate with the sponge", and we investigated which design choices in this prompt are the most important. Our conclusions raise the assumed limit of LLMs for robotics, and we reveal for the first time that LLMs do indeed possess an understanding of low-level robot control sufficient for a range of common tasks, and that they can additionally detect failures and then re-plan trajectories accordingly. Videos, prompts, and code are available at: <a class="link-external link-https" href="https://www.robot-learning.uk/language-models-trajectory-generators" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper attempts to explore whether large language models (LLMs) can directly generate dense end-effector trajectories required for robotic manipulation tasks without pre-trained skills, motion primitives, trajectory optimizers, or contextual examples. Specifically, the authors designed a single task-agnostic prompt to evaluate whether LLMs (such as GPT-4) can generate these complex trajectories using only object detection and segmentation vision models, and replan trajectories when task execution fails. ### Background and Motivation In recent years, large language models (LLMs) have garnered widespread attention for their exceptional capabilities in everyday task reasoning. Although LLMs excel in high-level task planning, they are generally not considered capable of low-level control. Existing methods typically rely on pre-trained skills, motion primitives, trajectory optimizers, and a large number of contextual examples. However, this assumption has not been thoroughly tested. Therefore, this paper aims to verify whether LLMs possess sufficient low-level control knowledge to generate dense end-effector trajectories in zero-shot scenarios. ### Main Contributions 1. **First Demonstration**: By providing only off-the-shelf object detection and segmentation models, pre-trained LLMs (such as GPT-4) can generate dense end-effector trajectories without pre-trained skills, motion primitives, trajectory optimizers, or contextual examples. 2. **Ablation Study**: Through a series of ablation experiments, the study reveals which techniques and prompt strategies lead to these capabilities. 3. **Task Failure Detection and Replanning**: The study investigates how LLMs can detect task failures by analyzing object trajectories in images and subsequently replan alternative trajectories. ### Experimental Setup - **Assumptions and Constraints**: - No use of pre-stored motion primitives, policies, or trajectory optimizers. - No use of contextual examples. - LLMs can query pre-trained vision models for scene information but must autonomously generate, parse, and interpret inputs and outputs. - No additional robot-specific data pre-training or fine-tuning. - **Experimental Equipment**: - Using a Sawyer robot and a Robotiq 2F-85 gripper. - Equipped with two Intel RealSense D435 RGB-D cameras, one mounted on the robot's wrist and the other fixed on a tripod. - **Task Selection**: - 30 everyday manipulation tasks from recent robotics papers were selected, covering a variety of representative tabletop robotic behaviors. ### Experimental Results - **Success Rate of 57.3%**: Out of 30 tasks, the LLM with the full prompt achieved an average success rate of 57.3% without replanning, which increased to 63.8% when replanning was allowed. - **Ablation Study**: By removing different parts of the prompt, it was found that step-by-step reasoning, code generation, and collision avoidance strategies significantly impacted task success rates. ### Conclusion This study demonstrates that LLMs indeed possess sufficient low-level control knowledge to generate dense end-effector trajectories in zero-shot scenarios and can detect task failures and replan. This provides a new, intuitive, and flexible approach to robotic manipulation, reducing the need for human time and supervision.

Language Models as Zero-Shot Trajectory Generators

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Large Language Models as General Pattern Machines

Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks

Decision-Making in Robotic Grasping with Large Language Models.

Long-horizon Locomotion and Manipulation on a Quadrupedal Robot with Large Language Models

Prompt a Robot to Walk with Large Language Models

Large Language Models as Zero-Shot Human Models for Human-Robot Interaction

Large Language Models for Robotics: Opportunities, Challenges, and Perspectives

Language-Driven Closed-Loop Grasping with Model-Predictive Trajectory Replanning

A Smart Interactive Camera Robot Based on Large Language Models

GG-LLM: Geometrically Grounding Large Language Models for Zero-shot Human Activity Forecasting in Human-Aware Task Planning

Generative Expressive Robot Behaviors using Large Language Models

Grounding Language Models in Autonomous Loco-manipulation Tasks

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

Language models are robotic planners: reframing plans as goal refinement graphs

Plan, Eliminate, and Track -- Language Models are Good Teachers for Embodied Agents

Generalizable Long-Horizon Manipulations with Large Language Models

Language to Rewards for Robotic Skill Synthesis