Abstract:Large Language models (LLMs) have shown remarkable success in assisting robot learning tasks, i.e., complex household planning. However, the performance of pretrained LLMs heavily relies on domain-specific templated text data, which may be infeasible in real-world robot learning tasks with image-based observations. Moreover, existing LLMs with text inputs lack the capability to evolve with non-expert interactions with environments. In this work, we introduce a novel learning paradigm that generates robots' executable actions in the form of text, derived solely from visual observations, using language-based summarization of these observations as the connecting bridge between both domains. Our proposed paradigm stands apart from previous works, which utilized either language instructions or a combination of language and visual data as inputs. Moreover, our method does not require oracle text summarization of the scene, eliminating the need for human involvement in the learning loop, which makes it more practical for real-world robot learning tasks. Our proposed paradigm consists of two modules: the SUM module, which interprets the environment using visual observations and produces a text summary of the scene, and the APM module, which generates executable action policies based on the natural language descriptions provided by the SUM module. We demonstrate that our proposed method can employ two fine-tuning strategies, including imitation learning and reinforcement learning approaches, to adapt to the target test tasks effectively. We conduct extensive experiments involving various SUM/APM model selections, environments, and tasks across 7 house layouts in the VirtualHome environment. Our experimental results demonstrate that our method surpasses existing baselines, confirming the effectiveness of this novel learning paradigm.

A Modular Framework for Robot Embodied Instruction Following by Large Language Model

Modular Framework for Visuomotor Language Grounding

CLFR-M: Continual Learning Framework for Robots Via Human Feedback and Dynamic Memory

Prompter: Utilizing Large Language Model Prompting for a Data Efficient Embodied Instruction Following

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

A Modular Vision Language Navigation and Manipulation Framework for Long Horizon Compositional Tasks in Indoor Environment

A learning framework for semantic reach-to-grasp tasks integrating machine learning and optimization.

FILM: Following Instructions in Language with Modular Methods

Object-Centric Instruction Augmentation for Robotic Manipulation

RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction Following

Verifiably Following Complex Robot Instructions with Foundation Models

AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation

ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning

Embodied Executable Policy Learning with Language-based Scene Summarization

Autonomous Improvement of Instruction Following Skills via Foundation Models

Language-Conditioned Robotic Manipulation with Fast and Slow Thinking

SEAL: Semantic Frame Execution And Localization for Perceiving Afforded Robot Actions

MHRC: Closed-loop Decentralized Multi-Heterogeneous Robot Collaboration with Large Language Models

Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion