Abstract:From an early age, humans are challenged with evaluating rich environments full of socially and physically grounded concepts. For example, we might be spectating a rapidly unfolding tennis match, anticipating ball trajectories based on the body cues and goals of players. In another scenario, we may engage with long storylines, juggling the mental states of characters with varying knowledge of an unfolding conflict. The complexity of this learning problem is notable as it can be multimodal, integrate information at varying timescales, and implicitly co-attend to social and physical scene properties for downstream reasoning. Large language-vision models like GPT4-V, LLaMA-3, which use vision-language embeddings, show skills in commonsense psychology and physics, though they only process single images. Models like CLIP and VisualBERT encode visual information in high-level cortical areas but do not inherently capture video-level representations. This paper introduces a novel video- language architecture that incorporates pooled video embeddings into LLMs by first extracting spatiotemporal embeddings and mapping them to the model decoder through a learnable linear layer. We enhance the model by training it with video-caption pairs from the ADEPT and AGENT datasets, aimed at judging surprisal in physical and psychological contexts with natural language. Finally, we design separate voxel wise encoding models for videos involving physics and psychology using the hidden states and logits from the LLMs last layer and pre-projected CLIP embeddings. We find that hidden state activations can remarkably explain high variance (up to ~70%) across dorsal physics regions and highly distributed, ventral social vision areas. Notably, for models trained to only encode physically surprising stimuli, the hidden states and pre-projected CLIP embeddings explain nearly identical regions of variance across the inferior-parietal lobule. However, when the encoding model is trained to encode only socially surprising events, hidden states explain far more distributed ventral and dorsal activations over pre-projected CLIP embeddings.

Like a Baby: Visually Situated Neural Language Acquisition

A computational model of early language acquisition from audiovisual experiences of young infants

Baby's CoThought: Leveraging Large Language Models for Enhanced Reasoning in Compact Models

Developmental Predictive Coding Model for Early Infancy Mono and Bilingual Vocal Continual Learning

Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

A Neural Network Model of Lexical Competition during Infant Spoken Word Recognition

Neural Language Modeling with Visual Features

Language Model-Based Paired Variational Autoencoders for Robotic Language Learning

Acquiring Linguistic Knowledge from Multimodal Input

Brain-Like Language Processing via a Shallow Untrained Multihead Attention Network

Mind the Context - The Impact of Contextualization in Neural Module Networks for Grounding Visual Referring Expressions.

Video-Language Models as Flexible Social and Physical Reasoners

Understanding Early Word Learning in Situated Artificial Agents

On Architectures for Including Visual Information in Neural Language Models for Image Description

Look Before you Speak: Visually Contextualized Utterances

Visually grounded learning of keyword prediction from untranscribed speech

Visual representations in the human brain are aligned with large language models

Can training neural language models on a curriculum with developmentally plausible data improve alignment with human reading behavior?

BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM

Embodied Language Grounding with 3D Visual Feature Representations

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web