Abstract:Humans and animals have a rich and flexible understanding of the physical world, which enables them to infer the underlying dynamical trajectories of objects and events, plausible future states, and use that to plan and anticipate the consequences of actions. However, the neural mechanisms underlying these computations are unclear. We combine a goal-driven modeling approach with dense neurophysiological data and high-throughput human behavioral readouts to directly impinge on this question. Specifically, we construct and evaluate several classes of sensory-cognitive networks to predict the future state of rich, ethologically-relevant environments, ranging from self-supervised end-to-end models with pixel-wise or object-centric objectives, to models that future predict in the latent space of purely static image-based or dynamic video-based pretrained foundation models. We find strong differentiation across these model classes in their ability to predict neural and behavioral data both within and across diverse environments. In particular, we find that neural responses are currently best predicted by models trained to predict the future state of their environment in the latent space of pretrained foundation models optimized for dynamic scenes in a self-supervised manner. Notably, models that future predict in the latent space of video foundation models that are optimized to support a diverse range of sensorimotor tasks, reasonably match both human behavioral error patterns and neural dynamics across all environmental scenarios that we were able to test. Overall, these findings suggest that the neural mechanisms and behaviors of primate mental simulation are thus far most consistent with being optimized to future predict on dynamic, reusable visual representations that are useful for Embodied AI more generally.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is to understand how the brain conducts mental simulation, especially how the brain predicts the future states of objects and events in dynamic scenes and uses these predictions to plan and anticipate the results of actions. Specifically, the researchers directly explored this issue by combining goal - driven modeling methods, intensive neurophysiological data, and high - throughput human behavior readings. They constructed and evaluated several types of sensory - cognitive networks to predict future states in rich, ecologically relevant environments, from self - supervised end - to - end models (with pixel - level or object - slot targets) to models that predict the future in the latent space of static - image pre - trained or dynamic - video pre - trained base models. The main findings of the paper include: 1. **Differences in model performance**: The study found that "size is not everything", and many state - of - the - art machine - learning models perform poorly in their neural and behavioral benchmarks. Only a certain type of model generally matches the data well, that is, those models that predict future states in the latent space of pre - trained base models optimized for dynamic scenes. 2. **Neural response prediction**: These models can not only best predict neural responses but also approach the ability of neurons to predict visual hidden environmental state variables, although they were not explicitly trained to do this. 3. **Inequality in the latent space of base models**: Not all base - model latent spaces are equivalent. In particular, models that predict the future in the latent space of video base models, if these models are optimized to support diverse egocentric sensorimotor tasks, can reasonably match human error patterns in behavior and neural dynamics across all tested environmental scenarios. In summary, this study reveals that the neural mechanisms and behaviors of primate mental simulation have strong inductive biases. So far, these mechanisms are most consistent with the hypothesis optimized for future prediction of widely useful visual representations, which are useful for more extensive embodied AI.

Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes

Neural Representations of Dynamic Visual Stimuli

Neural Dynamics Model of Visual Decision-Making: Learning from Human Experts

Modeling Dynamic Environments with Scene Graph Memory

Modeling dynamic neural activity by combining naturalistic video stimuli and stimulus-independent latent factors

Goal-Directed Behavior under Variational Predictive Coding: Dynamic Organization of Visual Attention and Working Memory

Foundation model of neural activity predicts response to new stimulus types and anatomy

Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity

Using Features at Multiple Temporal and Spatial Resolutions to Predict Human Behavior in Real Time

Deep Reinforcement Learning Models Predict Visual Responses in the Brain: A Preliminary Result

Peering into the future: Eye movements predict neural repetition effects during episodic simulation

Probabilistic Future Prediction for Video Scene Understanding

FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations

Deep Predictive Learning in Neocortex and Pulvinar

Modeling human activity comprehension at human scale: prediction, segmentation, and categorization

Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans

Predictive Encoding of Contextual Relationships for Perceptual Inference, Interpolation and Prediction.

Discovering Latent States for Model Learning: Applying Sensorimotor Contingencies Theory and Predictive Processing to Model Context

Modelling Human Visual Motion Processing with Trainable Motion Energy Sensing and a Self-attention Network

Learning 3D object-centric representation through prediction

Learning Physical Dynamics for Object-centric Visual Prediction