Abstract:One of the predominant methods for training world models is autoregressive prediction in the output space of the next element of a sequence. In Natural Language Processing (NLP), this takes the form of Large Language Models (LLMs) predicting the next token; in Computer Vision (CV), this takes the form of autoregressive models predicting the next frame/token/pixel. However, this approach differs from human cognition in several respects. First, human predictions about the future actively influence internal cognitive processes. Second, humans naturally evaluate the plausibility of predictions regarding future states. Based on this capability, and third, by assessing when predictions are sufficient, humans allocate a dynamic amount of time to make a prediction. This adaptive process is analogous to System 2 thinking in psychology. All these capabilities are fundamental to the success of humans at high-level reasoning and planning. Therefore, to address the limitations of traditional autoregressive models lacking these human-like capabilities, we introduce Energy-Based World Models (EBWM). EBWM involves training an Energy-Based Model (EBM) to predict the compatibility of a given context and a predicted future state. In doing so, EBWM enables models to achieve all three facets of human cognition described. Moreover, we developed a variant of the traditional autoregressive transformer tailored for Energy-Based models, termed the Energy-Based Transformer (EBT). Our results demonstrate that EBWM scales better with data and GPU Hours than traditional autoregressive transformers in CV, and that EBWM offers promising early scaling in NLP. Consequently, this approach offers an exciting path toward training future models capable of System 2 thinking and intelligently searching across state spaces.

Facing Off World Model Backbones: RNNs, Transformers, and S4

Learning a World Model With Multitimescale Memory Augmentation

Mastering Memory Tasks with World Models

TransDreamer: Reinforcement Learning with Transformer World Models

Augmenting Replay in World Models for Continual Reinforcement Learning

Learning Latent Dynamic Robust Representations for World Models

Locality Sensitive Sparse Encoding for Learning World Models Online

STORM: Efficient Stochastic Transformer based World Models for Reinforcement Learning

Harmony World Models: Boosting Sample Efficiency for Model-based Reinforcement Learning

One-shot World Models Using a Transformer Trained on a Synthetic Prior

The Effectiveness of World Models for Continual Reinforcement Learning

A Biologically-Inspired Dual Stream World Model

How the Brain Formulates Memory: A Spatio-Temporal Model Research Frontier.

Structured State Space Models for In-Context Reinforcement Learning

Evaluating World Models with LLM for Decision Making

Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models

Slot Structured World Models

Language Models Meet World Models: Embodied Experiences Enhance Language Models

Improving Token-Based World Models with Parallel Observation Prediction

Cognitively Inspired Energy-Based World Models