Evaluating Multiview Object Consistency in Humans and Image Models

Tyler Bonnen,Stephanie Fu,Yutong Bai,Thomas O'Connell,Yoni Friedman,Nancy Kanwisher,Joshua B. Tenenbaum,Alexei A. Efros
2024-09-10
Abstract:We introduce a benchmark to directly evaluate the alignment between human observers and vision models on a 3D shape inference task. We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape: given a set of images, participants identify which contain the same/different objects, despite considerable viewpoint variation. We draw from a diverse range of images that include common objects (e.g., chairs) as well as abstract shapes (i.e., procedurally generated `nonsense' objects). After constructing over 2000 unique image sets, we administer these tasks to human participants, collecting 35K trials of behavioral data from over 500 participants. This includes explicit choice behaviors as well as intermediate measures, such as reaction time and gaze data. We then evaluate the performance of common vision models (e.g., DINOv2, MAE, CLIP). We find that humans outperform all models by a wide margin. Using a multi-scale evaluation approach, we identify underlying similarities and differences between models and humans: while human-model performance is correlated, humans allocate more time/processing on challenging trials. All images, data, and code can be accessed via our project page.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of evaluating the ability of computer vision models to represent the 3D structure of objects, particularly in comparison to human observers. Specifically, the authors assess whether computer vision models can infer the 3D structure of objects from multiple viewpoint images like humans do, through a task called Multiview Object Consistency in Humans and in Image models (MOCHI). ### Main Issues 1. **Can computer vision models effectively represent the 3D structure of objects?** - Many tasks seemingly require explicit 3D representation but can actually be accomplished using 2D visual features directly. For example, depth estimation can be predicted using cues like texture gradient, relative size, shadows, and camera blur. Therefore, evaluating whether models truly understand 3D structure is a complex issue. 2. **How do computer vision models perform in the multiview object consistency task?** - The authors designed a classic cognitive science experiment where participants are asked to identify which image among three different viewpoints shows a different object. This task aims to evaluate the model's ability to infer 3D shapes in a zero-shot scenario. 3. **What are the performance differences between humans and computer vision models in the 3D shape inference task?** - The authors compared the performance of humans and various common computer vision models (such as DINOv2, MAE, CLIP, etc.) on the task, including accuracy, reaction time, and gaze data, to reveal similarities and differences between the two. ### Research Methods - **Experimental Design**: Using the multiview object consistency task, participants need to identify the different object among three images from different viewpoints. - **Data Collection**: Behavioral data from 35,000 trials were collected from over 500 participants, including choice behavior, reaction time, and gaze data. - **Model Evaluation**: Various common computer vision models were evaluated on the same task, and multiple evaluation metrics (such as distance metrics and linear probes) were used to compare human and model performance. ### Main Findings - **Humans outperform models**: Humans significantly outperformed computer vision models under all conditions, especially in more challenging trials. - **Relationship between model performance and scale**: For certain model types (like DINOv2), increasing model scale improved performance, but for other models (like MAE), increasing scale did not significantly improve performance. - **Correlation between human and model performance**: Although humans outperformed models, there was a certain correlation between their performances, particularly in terms of task difficulty. ### Conclusion The study systematically evaluated the performance of computer vision models in a 3D shape inference task through a multiview object consistency task and compared it with human observers. The results indicate that while computer vision models approach human-level performance under certain conditions, there remains a significant overall gap. These findings help better understand the limitations of computer vision models and provide directions for future research.