Abstract:From an early age, humans are challenged with evaluating rich environments full of socially and physically grounded concepts. For example, we might be spectating a rapidly unfolding tennis match, anticipating ball trajectories based on the body cues and goals of players. In another scenario, we may engage with long storylines, juggling the mental states of characters with varying knowledge of an unfolding conflict. The complexity of this learning problem is notable as it can be multimodal, integrate information at varying timescales, and implicitly co-attend to social and physical scene properties for downstream reasoning. Large language-vision models like GPT4-V, LLaMA-3, which use vision-language embeddings, show skills in commonsense psychology and physics, though they only process single images. Models like CLIP and VisualBERT encode visual information in high-level cortical areas but do not inherently capture video-level representations. This paper introduces a novel video- language architecture that incorporates pooled video embeddings into LLMs by first extracting spatiotemporal embeddings and mapping them to the model decoder through a learnable linear layer. We enhance the model by training it with video-caption pairs from the ADEPT and AGENT datasets, aimed at judging surprisal in physical and psychological contexts with natural language. Finally, we design separate voxel wise encoding models for videos involving physics and psychology using the hidden states and logits from the LLMs last layer and pre-projected CLIP embeddings. We find that hidden state activations can remarkably explain high variance (up to ~70%) across dorsal physics regions and highly distributed, ventral social vision areas. Notably, for models trained to only encode physically surprising stimuli, the hidden states and pre-projected CLIP embeddings explain nearly identical regions of variance across the inferior-parietal lobule. However, when the encoding model is trained to encode only socially surprising events, hidden states explain far more distributed ventral and dorsal activations over pre-projected CLIP embeddings.

Synthetic Vision: Training Vision-Language Models to Understand Physics

LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models

Vision-Language Model-based Physical Reasoning for Robot Liquid Perception

Physically Grounded Vision-Language Models for Robotic Manipulation

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

A little less conversation, a little more action, please: Investigating the physical common-sense of LLMs in a 3D embodied environment

Video-Language Models as Flexible Social and Physical Reasoners

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

VIPHY: Probing "Visible" Physical Commonsense Knowledge

ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Improving Physics Reasoning in Large Language Models Using Mixture of Refinement Agents

MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT Prompting

Smart Vision-Language Reasoners

MM-PhyRLHF: Reinforcement Learning Framework for Multimodal Physics Question-Answering

Exploring the Limits of Fine-grained LLM-based Physics Inference via Premise Removal Interventions

Enhancing Advanced Visual Reasoning Ability of Large Language Models

From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis

ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models

Can Language Models Understand Physical Concepts?

Enhancing LLMs for Physics Problem-Solving using Reinforcement Learning with Human-AI Feedback

CLEVR-POC: Reasoning-Intensive Visual Question Answering in Partially Observable Environments