UniZero: Generalized and Efficient Planning with Scalable Latent World Models

Yuan Pu,Yazhe Niu,Jiyuan Ren,Zhenjie Yang,Hongsheng Li,Yu Liu
2024-06-15
Abstract:Learning predictive world models is essential for enhancing the planning capabilities of reinforcement learning agents. Notably, the MuZero-style algorithms, based on the value equivalence principle and Monte Carlo Tree Search (MCTS), have achieved superhuman performance in various domains. However, in environments that require capturing long-term dependencies, MuZero's performance deteriorates rapidly. We identify that this is partially due to the \textit{entanglement} of latent representations with historical information, which results in incompatibility with the auxiliary self-supervised state regularization. To overcome this limitation, we present \textit{UniZero}, a novel approach that \textit{disentangles} latent states from implicit latent history using a transformer-based latent world model. By concurrently predicting latent dynamics and decision-oriented quantities conditioned on the learned latent history, UniZero enables joint optimization of the long-horizon world model and policy, facilitating broader and more efficient planning in latent space. We demonstrate that UniZero, even with single-frame inputs, matches or surpasses the performance of MuZero-style algorithms on the Atari 100k benchmark. Furthermore, it significantly outperforms prior baselines in benchmarks that require long-term memory. Lastly, we validate the effectiveness and scalability of our design choices through extensive ablation studies, visual analyses, and multi-task learning results. The code is available at \textcolor{magenta}{<a class="link-external link-https" href="https://github.com/opendilab/LightZero" rel="external noopener nofollow">this https URL</a>}.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the performance degradation of MuZero - style algorithms in tasks that require long - term dependencies. Specifically, MuZero's performance deteriorates rapidly when dealing with environments that require capturing long - term dependencies. The researchers found that this is mainly due to the entanglement of latent representations and historical information, resulting in the incompatibility of these representations with auxiliary self - supervised state regularization. In addition, MuZero underutilizes trajectory data during the training process, which further limits its performance and efficiency. To overcome these limitations, the paper proposes UniZero, a new method based on a Transformer - based latent world model. UniZero achieves efficient planning in long - term - dependency decision - making tasks by separating the latent state from the implicit historical information. Specifically, UniZero can simultaneously predict latent dynamics and decision - related quantities (such as policy and value) and optimize under the learned latent history conditions. This method not only improves the long - term planning ability but also performs well in tasks that require short - term memory. Through experiments on the Atari 100k benchmark, UniZero can match or outperform the performance of MuZero - style algorithms even with single - frame input. Moreover, in benchmark tests that require long - term memory, UniZero significantly outperforms previous baseline methods. These results verify the effectiveness and scalability of UniZero in handling long - term - dependency tasks.