Abstract:From an early age, humans are challenged with evaluating rich environments full of socially and physically grounded concepts. For example, we might be spectating a rapidly unfolding tennis match, anticipating ball trajectories based on the body cues and goals of players. In another scenario, we may engage with long storylines, juggling the mental states of characters with varying knowledge of an unfolding conflict. The complexity of this learning problem is notable as it can be multimodal, integrate information at varying timescales, and implicitly co-attend to social and physical scene properties for downstream reasoning. Large language-vision models like GPT4-V, LLaMA-3, which use vision-language embeddings, show skills in commonsense psychology and physics, though they only process single images. Models like CLIP and VisualBERT encode visual information in high-level cortical areas but do not inherently capture video-level representations. This paper introduces a novel video- language architecture that incorporates pooled video embeddings into LLMs by first extracting spatiotemporal embeddings and mapping them to the model decoder through a learnable linear layer. We enhance the model by training it with video-caption pairs from the ADEPT and AGENT datasets, aimed at judging surprisal in physical and psychological contexts with natural language. Finally, we design separate voxel wise encoding models for videos involving physics and psychology using the hidden states and logits from the LLMs last layer and pre-projected CLIP embeddings. We find that hidden state activations can remarkably explain high variance (up to ~70%) across dorsal physics regions and highly distributed, ventral social vision areas. Notably, for models trained to only encode physically surprising stimuli, the hidden states and pre-projected CLIP embeddings explain nearly identical regions of variance across the inferior-parietal lobule. However, when the encoding model is trained to encode only socially surprising events, hidden states explain far more distributed ventral and dorsal activations over pre-projected CLIP embeddings.

The Power of Next-Frame Prediction for Learning Physical Laws

Towards an Interpretable Latent Space in Structured Models for Video Prediction

Learning Physical Dynamics for Object-centric Visual Prediction

How Far is Video Generation from World Model: A Physical Law Perspective

Adaptive Hierarchical Motion-Focused Model for Video Prediction.

Predicting Long-horizon Futures by Conditioning on Geometry and Time

On the difficulty of learning and predicting the long-term dynamics of bouncing objects

Visual Physics: Discovering Physical Laws from Videos

Physics in Next-token Prediction

3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive Physics under Challenging Scenes

Next frame prediction using ConvLSTM

Unsupervised learning for physical interaction through video prediction

Video-Language Models as Flexible Social and Physical Reasoners

PastNet: Introducing Physical Inductive Biases for Spatio-temporal Video Prediction

Transformation-based models of video sequences

Learning to Predict 3D Rotational Dynamics from Images of a Rigid Body with Unknown Mass Distribution

Exploring and Exploiting High-Order Spatial-Temporal Dynamics for Long-Term Frame Prediction

Predicting the Physical Dynamics of Unseen 3D Objects

Predicting Physics in Mesh-reduced Space with Temporal Attention

What happens next and when "next" happens: Mechanisms of spatial and temporal prediction

Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems