Abstract:A significant aspiration of offline reinforcement learning (RL) is to develop a generalist agent with high capabilities from large and heterogeneous datasets. However, prior approaches that scale offline RL either rely heavily on expert trajectories or struggle to generalize to diverse unseen tasks. Inspired by the excellent generalization of world model in conditional video generation, we explore the potential of image observation-based world model for scaling offline RL and enhancing generalization on novel tasks. In this paper, we introduce JOWA: Jointly-Optimized World-Action model, an offline model-based RL agent pretrained on multiple Atari games with 6 billion tokens data to learn general-purpose representation and decision-making ability. Our method jointly optimizes a world-action model through a shared transformer backbone, which stabilize temporal difference learning with large models during pretraining. Moreover, we propose a provably efficient and parallelizable planning algorithm to compensate for the Q-value estimation error and thus search out better policies. Experimental results indicate that our largest agent, with 150 million parameters, achieves 78.9% human-level performance on pretrained games using only 10% subsampled offline data, outperforming existing state-of-the-art large-scale offline RL baselines by 31.6% on averange. Furthermore, JOWA scales favorably with model capacity and can sample-efficiently transfer to novel games using only 5k offline fine-tuning data (approximately 4 trajectories) per game, demonstrating superior generalization. We will release codes and model weights at <a class="link-external link-https" href="https://github.com/CJReinforce/JOWA" rel="external noopener nofollow">this https URL</a>.

Smaller World Models for Reinforcement Learning

Model-Based Reinforcement Learning for Atari

Recurrent World Models Facilitate Policy Evolution

Efficient World Models with Context-Aware Tokenization

Model-Based Bayesian Reinforcement Learning in Large Structured Domains

Reward-Free Curricula for Training Robust World Models

Minimal Value-Equivalent Partial Models for Scalable and Robust Planning in Lifelong Reinforcement Learning

Reward-free World Models for Online Imitation Learning

The Effectiveness of World Models for Continual Reinforcement Learning

STORM: Efficient Stochastic Transformer based World Models for Reinforcement Learning

Efficient Exploration and Discriminative World Model Learning with an Object-Centric Abstraction

Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models

Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining

Predictive World Models from Real-World Partial Observations

Learning a World Model With Multitimescale Memory Augmentation

PWM: Policy Learning with Large World Models

Deep Neuroevolution of Recurrent and Discrete World Models

Open-World Reinforcement Learning over Long Short-Term Imagination

Continual Learning Using World Models for Pseudo-Rehearsal

Efficient Imitation Learning with Conservative World Models