Abstract:We can visually discriminate and recognize a wide range of materials. Meanwhile, we use language to express our subjective understanding of visual input and communicate relevant information about the materials. Here, we investigate the relationship between visual judgment and language expression in material perception to understand how visual features relate to semantic representations. We use deep generative networks to construct an expandable image space to systematically create materials of well-defined and ambiguous categories. From such a space, we sampled diverse stimuli and compared the representations of materials from two behavioral tasks: visual material similarity judgments and free-form verbal descriptions. Our findings reveal a moderate but significant correlation between vision and language on a categorical level. However, analyzing the representations with an unsupervised alignment method, we discover structural differences that arise at the image-to-image level, especially among materials morphed between known categories. Moreover, visual judgments exhibit more individual differences compared to verbal descriptions. Our results show that while verbal descriptions capture material qualities on the coarse level, they may not fully convey the visual features that characterize the material's optical properties. Analyzing the image representation of materials obtained from various pre-trained data-rich deep neural networks, we find that human visual judgments' similarity structures align more closely with those of the text-guided visual-semantic model than purely vision-based models. Our findings suggest that while semantic representations facilitate material categorization, non-semantic visual features also play a significant role in discriminating materials at a finer level. This work illustrates the need to consider the vision-language relationship in building a comprehensive model for material perception. Moreover, we propose a novel framework for quantitatively evaluating the alignment and misalignment between representations from different modalities, leveraging information from human behaviors and computational models.

Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case

Do Multi-Sense Embeddings Improve Natural Language Understanding?

Probing Cross-modal Semantics Alignment Capability from the Textual Perspective

Probing Multilingual Sentence Representations With X-Probe

Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models

Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?

Investigating semantic subspaces of Transformer sentence embeddings through linear structural probing

Probing Multimodal Large Language Models for Global and Local Semantic Representations

Idioms, Probing and Dangerous Things: Towards Structural Probing for Idiomaticity in Vector Space

Probing Contextualized Sentence Representations with Visual Awareness

Probing Conceptual Understanding of Large Visual-Language Models

How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation

Learning semantic sentence representations from visually grounded language without lexical knowledge

Probing Cross-Lingual Lexical Knowledge from Multilingual Sentence Encoders

What is the limitation of multimodal LLMs? A deeper look into multimodal LLMs through prompt probing

Probing the Role of Positional Information in Vision-Language Models

Probing Pretrained Language Models for Lexical Semantics

Probing the Link Between Vision and Language in Material Perception Using Psychophysics and Unsupervised Learning

A Latent-Variable Model for Intrinsic Probing

Language Models As Zero-shot Visual Semantic Learners

ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map