Abstract:We introduce a benchmark to directly evaluate the alignment between human observers and vision models on a 3D shape inference task. We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape: given a set of images, participants identify which contain the same/different objects, despite considerable viewpoint variation. We draw from a diverse range of images that include common objects (e.g., chairs) as well as abstract shapes (i.e., procedurally generated `nonsense' objects). After constructing over 2000 unique image sets, we administer these tasks to human participants, collecting 35K trials of behavioral data from over 500 participants. This includes explicit choice behaviors as well as intermediate measures, such as reaction time and gaze data. We then evaluate the performance of common vision models (e.g., DINOv2, MAE, CLIP). We find that humans outperform all models by a wide margin. Using a multi-scale evaluation approach, we identify underlying similarities and differences between models and humans: while human-model performance is correlated, humans allocate more time/processing on challenging trials. All images, data, and code can be accessed via our project page.

What problem does this paper attempt to address?

The paper attempts to address the problem of evaluating the ability of computer vision models to represent the 3D structure of objects, particularly in comparison to human observers. Specifically, the authors assess whether computer vision models can infer the 3D structure of objects from multiple viewpoint images like humans do, through a task called Multiview Object Consistency in Humans and in Image models (MOCHI). ### Main Issues 1. **Can computer vision models effectively represent the 3D structure of objects?** - Many tasks seemingly require explicit 3D representation but can actually be accomplished using 2D visual features directly. For example, depth estimation can be predicted using cues like texture gradient, relative size, shadows, and camera blur. Therefore, evaluating whether models truly understand 3D structure is a complex issue. 2. **How do computer vision models perform in the multiview object consistency task?** - The authors designed a classic cognitive science experiment where participants are asked to identify which image among three different viewpoints shows a different object. This task aims to evaluate the model's ability to infer 3D shapes in a zero-shot scenario. 3. **What are the performance differences between humans and computer vision models in the 3D shape inference task?** - The authors compared the performance of humans and various common computer vision models (such as DINOv2, MAE, CLIP, etc.) on the task, including accuracy, reaction time, and gaze data, to reveal similarities and differences between the two. ### Research Methods - **Experimental Design**: Using the multiview object consistency task, participants need to identify the different object among three images from different viewpoints. - **Data Collection**: Behavioral data from 35,000 trials were collected from over 500 participants, including choice behavior, reaction time, and gaze data. - **Model Evaluation**: Various common computer vision models were evaluated on the same task, and multiple evaluation metrics (such as distance metrics and linear probes) were used to compare human and model performance. ### Main Findings - **Humans outperform models**: Humans significantly outperformed computer vision models under all conditions, especially in more challenging trials. - **Relationship between model performance and scale**: For certain model types (like DINOv2), increasing model scale improved performance, but for other models (like MAE), increasing scale did not significantly improve performance. - **Correlation between human and model performance**: Although humans outperformed models, there was a certain correlation between their performances, particularly in terms of task difficulty. ### Conclusion The study systematically evaluated the performance of computer vision models in a 3D shape inference task through a multiview object consistency task and compared it with human observers. The results indicate that while computer vision models approach human-level performance under certain conditions, there remains a significant overall gap. These findings help better understand the limitations of computer vision models and provide directions for future research.

Evaluating Multiview Object Consistency in Humans and Image Models

Approaching human 3D shape perception with neurally mappable models

When Does Perceptual Alignment Benefit Vision Representations?

Two Views Are Better than One: Monocular 3D Pose Estimation with Multiview Consistency

Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

Hidden in Plain Sight: Evaluating Abstract Shape Recognition in Vision-Language Models

Sapiens: Foundation for Human Vision Models

Towards Foundation Models for 3D Vision: How Close Are We?

Partial success in closing the gap between human and machine vision

Multi-View Consistency Loss for Improved Single-Image 3D Reconstruction of Clothed People

A large-scale examination of inductive biases shaping high-level visual representation in brains and machines

A unifying model for discordant and concordant results in human neuroimaging studies of facial viewpoint selectivity

Mismatched: Evaluating the Limits of Image Matching Approaches and Benchmarks

Generalizing Single-View 3D Shape Retrieval to Occlusions and Unseen Objects

Not all Views are Created Equal: Analyzing Viewpoint Instabilities in Vision Foundation Models

Moulding Humans: Non-parametric 3D Human Shape Estimation from Single Images

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

Consistent123: Improve Consistency for One Image to 3D Object Synthesis

ViewFormer: View Set Attention for Multi-view 3D Shape Understanding

3D Concept Learning and Reasoning from Multi-View Images

Monocular reconstruction of shapes of natural objects from orthographic and perspective images