Abstract:Children typically learn the meanings of nouns earlier than the meanings of verbs. However, it is unclear whether this asymmetry is a result of complexity in the visual structure of categories in the world to which language refers, the structure of language itself, or the interplay between the two sources of information. We quantitatively test these three hypotheses regarding early verb learning by employing visual and linguistic representations of words sourced from large-scale pre-trained artificial neural networks. Examining the structure of both visual and linguistic embedding spaces, we find, first, that the representation of verbs is generally more variable and less discriminable within domain than the representation of nouns. Second, we find that if only one learning instance per category is available, visual and linguistic representations are less well aligned in the verb system than in the noun system. However, in parallel with the course of human language development, if multiple learning instances per category are available, visual and linguistic representations become almost as well aligned in the verb system as in the noun system. Third, we compare the relative contributions of factors that may predict learning difficulty for individual words. A regression analysis reveals that visual variability is the strongest factor that internally drives verb learning, followed by visual-linguistic alignment and linguistic variability. Based on these results, we conclude that verb acquisition is influenced by all three sources of complexity, but that the variability of visual structure poses the most significant challenge for verb learning.

CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding

Probing the Role of Positional Information in Vision-Language Models

Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking

Probing Conceptual Understanding of Large Visual-Language Models

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Probing Pretrained Language Models for Lexical Semantics

VIPHY: Probing "Visible" Physical Commonsense Knowledge

A Latent-Variable Model for Intrinsic Probing

VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning

Verbs in Action: Improving verb understanding in video-language models

Scalable Performance Analysis for Vision-Language Models

Probing for targeted syntactic knowledge through grammatical error detection

Quantifying the Roles of Visual, Linguistic, and Visual-Linguistic Complexity in Verb Acquisition

Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case

Probing Cross-modal Semantics Alignment Capability from the Textual Perspective

Idioms, Probing and Dangerous Things: Towards Structural Probing for Idiomaticity in Vector Space

Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?

Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM

Probing Representations Learned by Multimodal Recurrent and Transformer Models

Seeing Through Their Eyes: Evaluating Visual Perspective Taking in Vision Language Models