PWM: Policy Learning with Large World Models

Ignat Georgiev,Varun Giridhar,Nicklas Hansen,Animesh Garg
2024-07-03
Abstract:Reinforcement Learning (RL) has achieved impressive results on complex tasks but struggles in multi-task settings with different embodiments. World models offer scalability by learning a simulation of the environment, yet they often rely on inefficient gradient-free optimization methods. We introduce Policy learning with large World Models (PWM), a novel model-based RL algorithm that learns continuous control policies from large multi-task world models. By pre-training the world model on offline data and using it for first-order gradient policy learning, PWM effectively solves tasks with up to 152 action dimensions and outperforms methods using ground-truth dynamics. Additionally, PWM scales to an 80-task setting, achieving up to 27% higher rewards than existing baselines without the need for expensive online planning. Visualizations and code available at <a class="link-external link-https" href="https://www.imgeorgiev.com/pwm" rel="external noopener nofollow">this https URL</a>
Machine Learning,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
The paper attempts to address the problem of efficient and robust reinforcement learning (RL) policy learning in a multi-task environment using large-scale world models. Specifically, the paper proposes a new model-based reinforcement learning algorithm—PWM (Policy learning with large World Models), which aims to quickly train continuous control policies through pre-trained large-scale multi-task world models. Compared to existing methods, PWM has the following advantages: 1. **Solving high-dimensional action space problems**: PWM can effectively handle tasks with action spaces as large as 152 dimensions, which is very challenging for traditional reinforcement learning methods. 2. **Improving reward scores**: In both single-task and multi-task settings, PWM can achieve higher reward scores than existing baselines. 3. **Avoiding online planning**: Unlike methods that rely on online planning, PWM can achieve efficient policy learning without depending on online planning. 4. **Adapting to various tasks**: In an environment containing 80 tasks, PWM significantly outperforms existing baseline methods and does not require expensive online planning processes. The paper validates the effectiveness of PWM through a series of experiments, particularly demonstrating better robustness and generalization when handling tasks rich in contact. Additionally, the paper explores the impact of different activation functions on the world model and the effect of batch size on policy learning, further proving the advantages of PWM in practical applications.