Abstract:World models constitute a promising approach for training reinforcement learning agents in a safe and sample-efficient manner. Recent world models predominantly operate on sequences of discrete latent variables to model environment dynamics. However, this compression into a compact discrete representation may ignore visual details that are important for reinforcement learning. Concurrently, diffusion models have become a dominant approach for image generation, challenging well-established methods modeling discrete latents. Motivated by this paradigm shift, we introduce DIAMOND (DIffusion As a Model Of eNvironment Dreams), a reinforcement learning agent trained in a diffusion world model. We analyze the key design choices that are required to make diffusion suitable for world modeling, and demonstrate how improved visual details can lead to improved agent performance. DIAMOND achieves a mean human normalized score of 1.46 on the competitive Atari 100k benchmark; a new best for agents trained entirely within a world model. We further demonstrate that DIAMOND's diffusion world model can stand alone as an interactive neural game engine by training on static Counter-Strike: Global Offensive gameplay. To foster future research on diffusion for world modeling, we release our code, agents, videos and playable world models at <a class="link-external link-https" href="https://diamond-wm.github.io" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### The problems the paper attempts to solve This paper aims to solve the problem of low sample efficiency in Reinforcement Learning (RL), especially in environments where visual details are crucial for task performance. Specifically, existing World Models usually compress information by modeling environmental dynamics as a sequence of discrete latent variables, which may lead to the loss of visual details. These details are crucial for certain tasks, such as traffic light recognition or pedestrian detection in autonomous driving. Therefore, the paper proposes a new method based on Diffusion Models - DIAMOND (DIffusion As a Model Of eNvironment Dreams) to improve the ability of World Models in generating high - quality visual details, thereby enhancing the performance of Reinforcement Learning agents. ### Main contributions 1. **Introduction of DIAMOND**: DIAMOND is a World Model based on Diffusion Models, which can generate high - quality visual details, thus improving the performance of Reinforcement Learning agents. 2. **Improvement of visual details**: By using Diffusion Models, DIAMOND can better capture and generate visual details, which is especially important for many real - world applications such as autonomous driving. 3. **High - performance performance**: DIAMOND achieved a human - normalized average score of 1.46 in the Atari 100k benchmark test, which is a new record for agents trained entirely in World Models. 4. **Interactive neural game engine**: The Diffusion World Model of DIAMOND can also serve as an independent interactive neural game engine, demonstrating its application potential in complex environments. ### Method overview 1. **Diffusion Models**: The paper adopts a Score - based Diffusion Models, which generates high - quality images through the reverse noise process. 2. **Conditional generation**: To meet the requirements of World Models, the Diffusion Model is conditioned, taking past states and actions as inputs to generate the next observation. 3. **Training objective**: The paper designs a training objective that adaptively mixes signals and noise, enabling the model to predict clear images even at high noise levels. 4. **Sampling method**: The paper uses the Euler method for sampling, avoiding the additional computational cost brought by high - order samplers. ### Experimental results 1. **Atari 100k benchmark test**: DIAMOND has achieved significant performance improvements in 26 Atari games, especially in environments where small details need to be captured, such as Asterix, Breakout and Road Runner. 2. **Comparison with existing methods**: DIAMOND outperforms other World Model baseline methods on multiple metrics, including STORM, DreamerV3, IRIS, etc. ### Conclusion By introducing DIAMOND based on Diffusion Models, the paper successfully solves the shortcomings of existing World Models in generating high - quality visual details and improves the performance of Reinforcement Learning agents. This method performs well in multiple benchmark tests, demonstrating its great potential in real - world applications.

Diffusion for World Modeling: Visual Details Matter in Atari

Mastering Atari with Discrete World Models

Diffusion World Model: Future Modeling Beyond Step-by-Step Rollout for Offline Reinforcement Learning

Diffusion Augmented Agents: A Framework for Efficient Exploration and Transfer Learning

State of the Art on Diffusion Models for Visual Computing

Large-scale Reinforcement Learning for Diffusion Models

Efficient Exploration and Discriminative World Model Learning with an Object-Centric Abstraction

Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning

Are Diffusion Models Vision-And-Language Reasoners?

Training Diffusion Models with Reinforcement Learning

AVID: Adapting Video Diffusion Models to World Models

Diffusion Models for Reinforcement Learning: A Survey

Efficient World Models with Context-Aware Tokenization

Architecting and Visualizing Deep Reinforcement Learning Models

Learning from Random Demonstrations: Offline Reinforcement Learning with Importance-Sampled Diffusion Models

Learning Generative Interactive Environments By Trained Agent Exploration

Learning to Play Atari in a World of Tokens

Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures

Explaining generative diffusion models via visual analysis for interpretable decision-making process

Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion