Grounding Large Language Models In Embodied Environment With Imperfect World Models

Haolan Liu,Jishen Zhao
2024-10-04
Abstract:Despite a widespread success in various applications, large language models (LLMs) often stumble when tackling basic physical reasoning or executing robotics tasks, due to a lack of direct experience with the physical nuances of the real world. To address these issues, we propose a Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data. GLIMO incorporates an LLM agent-based data generator to automatically create high-quality and diverse instruction datasets. The generator includes an iterative self-refining module for temporally consistent experience sampling, a diverse set of question-answering instruction seeds, and a retrieval-augmented generation module for reflecting on prior experiences. Comprehensive experiments show that our approach improve the performance of strong open-source LLMs like LLaMA-3 with a performance boost of 2.04 $\times$, 1.54 $\times$, and 1.82 $\times$ across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT-4.
Computation and Language,Machine Learning,Robotics
What problem does this paper attempt to address?
The paper attempts to address the issue of large language models (LLMs) performing poorly in basic physical reasoning and executing robotic tasks. Specifically, due to the lack of direct experience with real-world physical details, LLMs often make errors in these tasks. To overcome these issues, the authors propose a new method—GLIMO (Grounding Large language model with Imperfect world MOdel), which improves LLMs' performance in practical tasks by using proxy world models (such as simulators) to collect and synthesize training data. ### Main Issues: 1. **Insufficient Physical Reasoning Ability**: LLMs perform poorly in tasks that require understanding the physical environment, such as robotic operations and autonomous driving. 2. **Lack of Real-World Experience**: LLMs are primarily trained on text corpora and lack an understanding of real-world physical details. 3. **Insufficient Data Quality and Diversity**: Existing methods rely on manually designed prompts or tasks, making it difficult to generate high-quality and diverse data. 4. **Limited Generalization Ability**: Existing methods are effective in small-scale environments but perform poorly in open real-world tasks. ### Solution: - **GLIMO Framework**: Utilizes proxy world models (such as simulators) to collect and synthesize training data, improving LLMs' physical reasoning and task execution abilities through automated data generation and annotation processes. - **Self-Optimization Mechanism**: Introduces an iterative self-optimization module to ensure temporal consistency of data and enhances data quality and diversity through a retrieval-augmented generation module. - **Multi-Task Training**: Improves LLMs' generalization ability and adaptability through various task templates and self-supervised learning. ### Experimental Results: - **Performance Improvement**: Experimental results show that GLIMO significantly improves the performance of multiple open-source LLMs (such as LLaMA-3, OPT-13B) in different benchmark tests, with improvements of 2.04 times, 1.54 times, and 1.82 times, respectively. - **Surpassing Existing Models**: In the Agent World and Urban Driving environments, GLIMO outperforms the current state-of-the-art closed-source model GPT-4, with performance improvements of 51.7% and 84.5%, respectively. ### Conclusion: GLIMO compensates for LLMs' deficiencies in physical reasoning and task execution by using proxy world models, significantly enhancing LLMs' performance and generalization ability in real-world tasks. This approach provides new directions for future research, particularly in fields such as robotic learning and autonomous driving.