Naming, Describing, and Quantifying Visual Objects in Humans and LLMs

Alberto Testoni,Juell Sprott,Sandro Pezzelle

2024-06-04

Abstract:While human speakers use a variety of different expressions when describing the same object in an image, giving rise to a distribution of plausible labels driven by pragmatic constraints, the extent to which current Vision & Language Large Language Models (VLLMs) can mimic this crucial feature of language use is an open question. This applies to common, everyday objects, but it is particularly interesting for uncommon or novel objects for which a category label may be lacking or fuzzy. Furthermore, similar patterns of variation are observed among human speakers for highly context-sensitive expressions, such as the quantifiers 'few' or 'most'. In our work, we evaluate VLLMs (FROMAGe, BLIP-2, LLaVA) on three categories (nouns, attributes, and quantifiers) where humans show great subjective variability concerning the distribution over plausible labels, using datasets and resources mostly under-explored in previous work. Our results reveal mixed evidence on the ability of VLLMs to capture human naming preferences at generation time: while some models are good at mimicking human distributions for nouns and attributes, all of them fail to assign quantifiers, a task that requires more accurate, high-level reasoning.

Computation and Language

What problem does this paper attempt to address?

The paper attempts to address whether current Vision and Language Large Models (VLLMs) can mimic the variability characteristics of human language use in naming tasks. Specifically, the study focuses on the following aspects: 1. **Naming common objects**: Exploring whether the models can exhibit diversity in describing common objects in images, similar to humans. 2. **Naming novel objects**: Investigating the models' ability to name uncommon or novel objects, which may lack clear category labels or have ambiguous labels. 3. **Quantitative expressions**: Analyzing the models' performance in tasks requiring advanced reasoning abilities, such as using quantitative words to describe the number of objects in an image. The study finds that while some models can moderately mimic human patterns in naming common objects and color terms, they perform poorly in choosing quantitative words. This indicates limitations in the models' skills in quantity estimation and comparison.

Naming, Describing, and Quantifying Visual Objects in Humans and LLMs

Democratizing Fine-grained Visual Recognition with Large Language Models

Human-like object concept representations emerge naturally in multimodal large language models

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling

Leveraging VLM-Based Pipelines to Annotate 3D Objects

Pay Attention to Those Sets! Learning Quantification from Images

A Large-Scale Multilingual Study of Visual Constraints on Linguistic Selection of Descriptions

LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description

An Introduction to Vision-Language Modeling

Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis

With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models

Towards Interpreting Visual Information Processing in Vision-Language Models

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

Yo'LLaVA: Your Personalized Language and Vision Assistant

Evaluating the Semantic Profiling Abilities of LLMs for Natural Language Utterances in Data Visualization