UniZero: Generalized and Efficient Planning with Scalable Latent World Models

Yuan Pu,Yazhe Niu,Jiyuan Ren,Zhenjie Yang,Hongsheng Li,Yu Liu

2024-06-15

Abstract:Learning predictive world models is essential for enhancing the planning capabilities of reinforcement learning agents. Notably, the MuZero-style algorithms, based on the value equivalence principle and Monte Carlo Tree Search (MCTS), have achieved superhuman performance in various domains. However, in environments that require capturing long-term dependencies, MuZero's performance deteriorates rapidly. We identify that this is partially due to the \textit{entanglement} of latent representations with historical information, which results in incompatibility with the auxiliary self-supervised state regularization. To overcome this limitation, we present \textit{UniZero}, a novel approach that \textit{disentangles} latent states from implicit latent history using a transformer-based latent world model. By concurrently predicting latent dynamics and decision-oriented quantities conditioned on the learned latent history, UniZero enables joint optimization of the long-horizon world model and policy, facilitating broader and more efficient planning in latent space. We demonstrate that UniZero, even with single-frame inputs, matches or surpasses the performance of MuZero-style algorithms on the Atari 100k benchmark. Furthermore, it significantly outperforms prior baselines in benchmarks that require long-term memory. Lastly, we validate the effectiveness and scalability of our design choices through extensive ablation studies, visual analyses, and multi-task learning results. The code is available at \textcolor{magenta}{<a class="link-external link-https" href="https://github.com/opendilab/LightZero" rel="external noopener nofollow">this https URL</a>}.

Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the performance degradation of MuZero - style algorithms in tasks that require long - term dependencies. Specifically, MuZero's performance deteriorates rapidly when dealing with environments that require capturing long - term dependencies. The researchers found that this is mainly due to the entanglement of latent representations and historical information, resulting in the incompatibility of these representations with auxiliary self - supervised state regularization. In addition, MuZero underutilizes trajectory data during the training process, which further limits its performance and efficiency. To overcome these limitations, the paper proposes UniZero, a new method based on a Transformer - based latent world model. UniZero achieves efficient planning in long - term - dependency decision - making tasks by separating the latent state from the implicit historical information. Specifically, UniZero can simultaneously predict latent dynamics and decision - related quantities (such as policy and value) and optimize under the learned latent history conditions. This method not only improves the long - term planning ability but also performs well in tasks that require short - term memory. Through experiments on the Atari 100k benchmark, UniZero can match or outperform the performance of MuZero - style algorithms even with single - frame input. Moreover, in benchmark tests that require long - term memory, UniZero significantly outperforms previous baseline methods. These results verify the effectiveness and scalability of UniZero in handling long - term - dependency tasks.

UniZero: Generalized and Efficient Planning with Scalable Latent World Models

Interpreting the Learned Model in MuZero Planning

LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios

Mastering Atari, Go, chess and shogi by planning with a learned model

Multiagent Gumbel MuZero: Efficient Planning in Combinatorial Action Spaces

Evaluating World Models with LLM for Decision Making

Efficient Multi-agent Reinforcement Learning by Planning

Efficient Offline Policy Optimization with a Learned Model

Agents Explore the Environment Beyond Good Actions to Improve Their Model for Better Decisions

BetaZero: Belief-State Planning for Long-Horizon POMDPs using Learned Approximations

Zero-shot Safety Prediction for Autonomous Robots with Foundation World Models

PIANIST: Learning Partially Observable World Models with LLMs for Multi-Agent Decision Making

UniWorld: Autonomous Driving Pre-training via World Models

Efficient Exploration and Discriminative World Model Learning with an Object-Centric Abstraction

On the role of planning in model-based deep reinforcement learning

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

What model does MuZero learn?

Learning Latent Dynamic Robust Representations for World Models

Mastering construction heuristics with self-play deep reinforcement learning

Zero-shot Policy Learning with Spatial Temporal RewardDecomposition on Contingency-aware Observation

Zero-shot Policy Learning with Spatial Temporal Reward Decomposition on Contingency-aware Observation.