Abstract:We can visually discriminate and recognize a wide range of materials. Meanwhile, we use language to express our subjective understanding of visual input and communicate relevant information about the materials. Here, we investigate the relationship between visual judgment and language expression in material perception to understand how visual features relate to semantic representations. We use deep generative networks to construct an expandable image space to systematically create materials of well-defined and ambiguous categories. From such a space, we sampled diverse stimuli and compared the representations of materials from two behavioral tasks: visual material similarity judgments and free-form verbal descriptions. Our findings reveal a moderate but significant correlation between vision and language on a categorical level. However, analyzing the representations with an unsupervised alignment method, we discover structural differences that arise at the image-to-image level, especially among materials morphed between known categories. Moreover, visual judgments exhibit more individual differences compared to verbal descriptions. Our results show that while verbal descriptions capture material qualities on the coarse level, they may not fully convey the visual features that characterize the material's optical properties. Analyzing the image representation of materials obtained from various pre-trained data-rich deep neural networks, we find that human visual judgments' similarity structures align more closely with those of the text-guided visual-semantic model than purely vision-based models. Our findings suggest that while semantic representations facilitate material categorization, non-semantic visual features also play a significant role in discriminating materials at a finer level. This work illustrates the need to consider the vision-language relationship in building a comprehensive model for material perception. Moreover, we propose a novel framework for quantitatively evaluating the alignment and misalignment between representations from different modalities, leveraging information from human behaviors and computational models.

Probing Vision and Language Models for Construction Waste Material Recognition

Deep Multimodal Learning for Municipal Solid Waste Sorting

Optimally leveraging depth features to enhance segmentation of recyclables from cluttered construction and demolition waste streams

Deep Recyclable Trash Sorting Using Integrated Parallel Attention

Evolution and challenges of computer vision and deep learning technologies for analysing mixed construction and demolition waste

Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning.

Refining Skewed Perceptions in Vision-Language Models through Visual Representations

ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning

Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment

Selective Vision-Language Subspace Projection for Few-shot CLIP

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model

Deep learning-based models for environmental management: Recognizing construction, renovation, and demolition waste in-the-wild

Scalable Performance Analysis for Vision-Language Models

Probing the Link Between Vision and Language in Material Perception Using Psychophysics and Unsupervised Learning

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts