Towards Foundation Models for 3D Vision: How Close Are We?

Yiming Zuo,Karhan Kayan,Maggie Wang,Kevin Jeon,Jia Deng,Thomas L. Griffiths
2024-10-15
Abstract:Building a foundation model for 3D vision is a complex challenge that remains unsolved. Towards that goal, it is important to understand the 3D reasoning capabilities of current models as well as identify the gaps between these models and humans. Therefore, we construct a new 3D visual understanding benchmark that covers fundamental 3D vision tasks in the Visual Question Answering (VQA) format. We evaluate state-of-the-art Vision-Language Models (VLMs), specialized models, and human subjects on it. Our results show that VLMs generally perform poorly, while the specialized models are accurate but not robust, failing under geometric perturbations. In contrast, human vision continues to be the most reliable 3D visual system. We further demonstrate that neural networks align more closely with human 3D vision mechanisms compared to classical computer vision methods, and Transformer-based networks such as ViT align more closely with human 3D vision mechanisms than CNNs. We hope our study will benefit the future development of foundation models for 3D vision.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the challenges faced in constructing foundational models for 3D vision. Specifically, the authors aim to understand the reasoning capabilities of current models in 3D vision tasks and identify the gaps between these models and the human visual system. To achieve this goal, the authors have developed a new benchmark for 3D vision understanding, covering fundamental 3D vision tasks such as depth estimation, spatial visual question answering (VQA), camera pose estimation, and keypoint matching. By evaluating the performance of state-of-the-art vision-language models (VLMs), specialized models, and humans on this benchmark, the authors hope to provide valuable insights for the future development of foundational models for 3D vision. ### Main Research Questions: 1. **Do 2D VLMs have the capability to solve 3D tasks?** - The study found that although 2D VLMs perform well on existing 2D VQA benchmarks, they perform poorly on 3D tasks. Existing VLMs fail to achieve human-level performance and, in some tasks, perform only slightly better than random guessing, especially on geometrically perturbed images. 2. **Are specialized models accurate and robust?** - Specialized models generally exhibit high accuracy but lack robustness to geometric perturbations. For example, in the depth estimation task, the accuracy of the MiDaS model significantly decreases on inverted images, whereas human accuracy remains unchanged. 3. **Are humans still the most accurate and robust 3D vision system? How do the error patterns of different models compare to humans?** - The results indicate that humans are still the most accurate and robust 3D vision system. The error patterns of different models vary significantly depending on the model type and architecture. Specifically, Transformer-based models (such as ViT) have error patterns more similar to humans, while CNN-based models are less similar. ### Main Contributions: - **Proposed a new benchmark**: This benchmark has a unified output space for evaluating the 3D understanding capabilities of existing models. - **Comprehensive evaluation**: Evaluated the performance of state-of-the-art VLMs, specialized models, and humans on the new benchmark, and compared the error patterns of different models to humans from multiple criteria. Through this research, the authors hope to provide insights for improving the robustness and generalization capabilities of future foundational models for 3D vision.