Abstract:Despite a widespread success in various applications, large language models (LLMs) often stumble when tackling basic physical reasoning or executing robotics tasks, due to a lack of direct experience with the physical nuances of the real world. To address these issues, we propose a Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data. GLIMO incorporates an LLM agent-based data generator to automatically create high-quality and diverse instruction datasets. The generator includes an iterative self-refining module for temporally consistent experience sampling, a diverse set of question-answering instruction seeds, and a retrieval-augmented generation module for reflecting on prior experiences. Comprehensive experiments show that our approach improve the performance of strong open-source LLMs like LLaMA-3 with a performance boost of 2.04 $\times$, 1.54 $\times$, and 1.82 $\times$ across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT-4.

What problem does this paper attempt to address?

The paper attempts to address the issue of large language models (LLMs) performing poorly in basic physical reasoning and executing robotic tasks. Specifically, due to the lack of direct experience with real-world physical details, LLMs often make errors in these tasks. To overcome these issues, the authors propose a new method—GLIMO (Grounding Large language model with Imperfect world MOdel), which improves LLMs' performance in practical tasks by using proxy world models (such as simulators) to collect and synthesize training data. ### Main Issues: 1. **Insufficient Physical Reasoning Ability**: LLMs perform poorly in tasks that require understanding the physical environment, such as robotic operations and autonomous driving. 2. **Lack of Real-World Experience**: LLMs are primarily trained on text corpora and lack an understanding of real-world physical details. 3. **Insufficient Data Quality and Diversity**: Existing methods rely on manually designed prompts or tasks, making it difficult to generate high-quality and diverse data. 4. **Limited Generalization Ability**: Existing methods are effective in small-scale environments but perform poorly in open real-world tasks. ### Solution: - **GLIMO Framework**: Utilizes proxy world models (such as simulators) to collect and synthesize training data, improving LLMs' physical reasoning and task execution abilities through automated data generation and annotation processes. - **Self-Optimization Mechanism**: Introduces an iterative self-optimization module to ensure temporal consistency of data and enhances data quality and diversity through a retrieval-augmented generation module. - **Multi-Task Training**: Improves LLMs' generalization ability and adaptability through various task templates and self-supervised learning. ### Experimental Results: - **Performance Improvement**: Experimental results show that GLIMO significantly improves the performance of multiple open-source LLMs (such as LLaMA-3, OPT-13B) in different benchmark tests, with improvements of 2.04 times, 1.54 times, and 1.82 times, respectively. - **Surpassing Existing Models**: In the Agent World and Urban Driving environments, GLIMO outperforms the current state-of-the-art closed-source model GPT-4, with performance improvements of 51.7% and 84.5%, respectively. ### Conclusion: GLIMO compensates for LLMs' deficiencies in physical reasoning and task execution by using proxy world models, significantly enhancing LLMs' performance and generalization ability in real-world tasks. This approach provides new directions for future research, particularly in fields such as robotic learning and autonomous driving.

Grounding Large Language Models In Embodied Environment With Imperfect World Models

Language Models Meet World Models: Embodied Experiences Enhance Language Models

Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning

Self-driven Grounding: Large Language Model Agents with Automatical Language-aligned Skill Learning

Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration

Large Language Models for Robotics: Opportunities, Challenges, and Perspectives

LLM+ A: Grounding Large Language Models in Physical World with Affordance Prompting

How Well Do Large Language Models Truly Ground?

A little less conversation, a little more action, please: Investigating the physical common-sense of LLMs in a 3D embodied environment

LEGENT: Open Platform for Embodied Agents

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

LLaMA Rider: Spurring Large Language Models to Explore the Open World

Integration of LLMs and the Physical World: Research and Application

Are Large Language Models Temporally Grounded?

Grounding Language with Visual Affordances over Unstructured Data

Grounding Multimodal Large Language Models in Actions

Making Large Language Models into World Models with Precondition and Effect Knowledge

Remember what you did so you know what to do next

LanGWM: Language Grounded World Model

Hypothesis, Verification, and Induction: Grounding Large Language Models with Self-Driven Skill Learning