Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents

Zihao Zhou,Bin Hu,Chenyang Zhao,Pu Zhang,Bin Liu
2024-05-27
Abstract:Recent studies have uncovered the potential of Large Language Models (LLMs) in addressing complex sequential decision-making tasks through the provision of high-level instructions. However, LLM-based agents lack specialization in tackling specific target problems, particularly in real-time dynamic environments. Additionally, deploying an LLM-based agent in practical scenarios can be both costly and time-consuming. On the other hand, reinforcement learning (RL) approaches train agents that specialize in the target task but often suffer from low sampling efficiency and high exploration costs. In this paper, we introduce a novel framework that addresses these challenges by training a smaller, specialized student RL agent using instructions from an LLM-based teacher agent. By incorporating the guidance from the teacher agent, the student agent can distill the prior knowledge of the LLM into its own model. Consequently, the student agent can be trained with significantly less data. Moreover, through further training with environment feedback, the student agent surpasses the capabilities of its teacher for completing the target task. We conducted experiments on challenging MiniGrid and Habitat environments, specifically designed for embodied AI research, to evaluate the effectiveness of our framework. The results clearly demonstrate that our approach achieves superior performance compared to strong baseline methods. Our code is available at <a class="link-external link-https" href="https://github.com/ZJLAB-AMMI/LLM4Teach" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence
What problem does this paper attempt to address?
The main aim of this paper is to address the limitations of large language models (LLMs) in executing specific tasks, particularly in decision-making within real-time dynamic environments. Specifically, the paper proposes new solutions to the following issues: 1. **LLM's lack of task specialization**: Although LLMs can handle complex sequential decision-making tasks and provide high-level instructions, they lack specialization in specific target problems, especially in real-time dynamic environments. 2. **High deployment costs**: Using LLMs for decision-making often requires substantial computational resources, such as memory and power, making their deployment in practical applications very costly. 3. **Inefficient sampling in reinforcement learning (RL)**: Traditional RL methods often have low sampling efficiency in complex and high-dimensional environments, especially in cases of sparse reward signals, leading to slow and costly learning processes. To address these issues, the authors propose a new framework called "LLM for Policy Teaching (LLM4Teach)." The core idea of this framework is to use a pre-trained LLM as a teacher agent to guide a lightweight student RL agent in quickly acquiring decision-making capabilities for specific tasks. Specifically, the student agent learns to imitate the teacher's behavior in the early stages by minimizing the difference between its actions and those of the teacher. As learning progresses, the student agent gradually shifts from relying on the teacher to relying on environmental feedback, achieved by adjusting the weights of the loss terms from teacher guidance and traditional RL in the learning objective function. The main contributions of this method include: - Proposing a policy distillation method (LLM4Teach) to overcome the limitations of LLM and RL-based agents in embodied sequential decision-making. - Demonstrating the effectiveness of the method through extensive experiments in challenging embodied environments, showing higher accuracy and lower computational burden compared to methods based solely on LLM or RL. - Highlighting that LLMs can produce various types of erroneous decisions in embodied settings, and LLM4Teach provides an effective way to mitigate or avoid the impact of these errors. Additionally, the paper verifies that providing uncertainty-aware rather than deterministic guidance through LLM can improve the learning efficiency of the student agent. In summary, this research aims to develop a new type of agent system that can learn quickly and solve problems efficiently by combining the powerful reasoning capabilities of LLMs with the effective learning mechanisms of RL.