Computer Vision Datasets and Models Exhibit Cultural and Linguistic Diversity in Perception

Andre Ye,Sebastin Santy,Jena D. Hwang,Amy X. Zhang,Ranjay Krishna

DOI: https://doi.org/10.48550/arXiv.2310.14356

2024-03-10

Abstract:Computer vision often treats human perception as homogeneous: an implicit assumption that visual stimuli are perceived similarly by everyone. This assumption is reflected in the way researchers collect datasets and train vision models. By contrast, literature in cross-cultural psychology and linguistics has provided evidence that people from different cultural backgrounds observe vastly different concepts even when viewing the same visual stimuli. In this paper, we study how these differences manifest themselves in vision-language datasets and models, using language as a proxy for culture. By comparing textual descriptions generated across 7 languages for the same images, we find significant differences in the semantic content and linguistic expression. When datasets are multilingual as opposed to monolingual, descriptions have higher semantic coverage on average, where coverage is measured using scene graphs, model embeddings, and linguistic taxonomies. For example, multilingual descriptions have on average 29.9% more objects, 24.5% more relations, and 46.0% more attributes than a set of monolingual captions. When prompted to describe images in different languages, popular models (e.g. LLaVA) inherit this bias and describe different parts of the image. Moreover, finetuning models on captions from one language performs best on corresponding test data from that language, while finetuning on multilingual data performs consistently well across all test data compositions. Our work points towards the need to account for and embrace the diversity of human perception in the computer vision community.

Computer Vision and Pattern Recognition,Computation and Language,Computers and Society,Human-Computer Interaction

What problem does this paper attempt to address?

The paper attempts to address a potential issue in the field of computer vision: the homogeneity assumption of human perception. The authors point out that current computer vision research often assumes that all humans perceive visual stimuli similarly, which is reflected in the way datasets are constructed and models are trained. However, cross-cultural and linguistic psychological studies indicate that people from different cultural backgrounds may observe vastly different concepts even when faced with the same visual stimuli. Therefore, this paper aims to explore how this diversity is manifested in visual-language datasets and models. Specifically, by comparing the textual content of descriptions of the same image in different languages, the authors found significant differences in semantic content and expression. For example, compared to monolingual descriptions, multilingual descriptions on average contain more objects (29.9%), relationships (24.5%), and attributes (46.0%). Additionally, experiments show that when popular models (such as LLaVA) are prompted in different languages, these models exhibit attention to different parts of the image. Furthermore, models fine-tuned on data in one language perform best on test data in the corresponding language, while models fine-tuned on multilingual data perform consistently well across all test data. In summary, through a series of experimental evidence, the paper reveals the linguistic and cultural diversity in image annotations within computer vision datasets and suggests that computer vision researchers should consider this diversity, including incorporating annotator background information in datasets and prioritizing the use of multilingual visual models.

Computer Vision Datasets and Models Exhibit Cultural and Linguistic Diversity in Perception

See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

Investigating Cultural Diversity for Extrafoveal Information Use in Visual Scenes.

Multilingual Diversity Improves Vision-Language Representations

Vision-Language Models under Cultural and Inclusive Considerations

No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models

How Culturally Aware are Vision-Language Models?

From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models

CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries

Examining Gender and Racial Bias in Large Vision-Language Models Using a Novel Dataset of Parallel Images

Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI

Survey of Social Bias in Vision-Language Models

Cross-Lingual and Cross-Cultural Variation in Image Descriptions

Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration

Benchmarking Vision Language Models for Cultural Understanding

Can 3D Vision-Language Models Truly Understand Natural Language?

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

Mapping Bias in Vision Language Models: Signposts, Pitfalls, and the Road Ahead