PWM: Policy Learning with Large World Models

Ignat Georgiev,Varun Giridhar,Nicklas Hansen,Animesh Garg

2024-07-03

Abstract:Reinforcement Learning (RL) has achieved impressive results on complex tasks but struggles in multi-task settings with different embodiments. World models offer scalability by learning a simulation of the environment, yet they often rely on inefficient gradient-free optimization methods. We introduce Policy learning with large World Models (PWM), a novel model-based RL algorithm that learns continuous control policies from large multi-task world models. By pre-training the world model on offline data and using it for first-order gradient policy learning, PWM effectively solves tasks with up to 152 action dimensions and outperforms methods using ground-truth dynamics. Additionally, PWM scales to an 80-task setting, achieving up to 27% higher rewards than existing baselines without the need for expensive online planning. Visualizations and code available at <a class="link-external link-https" href="https://www.imgeorgiev.com/pwm" rel="external noopener nofollow">this https URL</a>

Machine Learning,Artificial Intelligence,Robotics

What problem does this paper attempt to address?

The paper attempts to address the problem of efficient and robust reinforcement learning (RL) policy learning in a multi-task environment using large-scale world models. Specifically, the paper proposes a new model-based reinforcement learning algorithm—PWM (Policy learning with large World Models), which aims to quickly train continuous control policies through pre-trained large-scale multi-task world models. Compared to existing methods, PWM has the following advantages: 1. **Solving high-dimensional action space problems**: PWM can effectively handle tasks with action spaces as large as 152 dimensions, which is very challenging for traditional reinforcement learning methods. 2. **Improving reward scores**: In both single-task and multi-task settings, PWM can achieve higher reward scores than existing baselines. 3. **Avoiding online planning**: Unlike methods that rely on online planning, PWM can achieve efficient policy learning without depending on online planning. 4. **Adapting to various tasks**: In an environment containing 80 tasks, PWM significantly outperforms existing baseline methods and does not require expensive online planning processes. The paper validates the effectiveness of PWM through a series of experiments, particularly demonstrating better robustness and generalization when handling tasks rich in contact. Additionally, the paper explores the impact of different activation functions on the world model and the effect of batch size on policy learning, further proving the advantages of PWM in practical applications.

PWM: Policy Learning with Large World Models

Scaling Population-Based Reinforcement Learning with GPU Accelerated Simulation

Operator World Models for Reinforcement Learning

Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining

Harmony World Models: Boosting Sample Efficiency for Model-based Reinforcement Learning

Learning Latent Dynamic Robust Representations for World Models

Gradient-based Planning with World Models

The RL Perceptron: Generalisation Dynamics of Policy Learning in High Dimensions

World Models Increase Autonomy in Reinforcement Learning

Masked World Models for Visual Control

Imagined Value Gradients: Model-Based Policy Optimization with Transferable Latent Dynamics Models

MAMBPO: Sample-efficient multi-robot reinforcement learning using learned world models

OMPO: A Unified Framework for RL under Policy and Dynamics Shifts

Smaller World Models for Reinforcement Learning

Efficient Imitation Learning with Conservative World Models

Reward-free World Models for Online Imitation Learning

Model-based Policy Optimization using Symbolic World Model

Expert Level control of Ramp Metering based on Multi-task Deep Reinforcement Learning

HarmonyDream: Task Harmonization Inside World Models

Hierarchical World Models as Visual Whole-Body Humanoid Controllers

Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks