OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

Youjun Zhao,Jiaying Lin,Shuquan Ye,Qianshi Pang,Rynson W.H. Lau
2024-08-21
Abstract:Open-vocabulary 3D scene understanding (OV-3D) aims to localize and classify novel objects beyond the closed object classes. However, existing approaches and benchmarks primarily focus on the open vocabulary problem within the context of object classes, which is insufficient to provide a holistic evaluation to what extent a model understands the 3D scene. In this paper, we introduce a more challenging task called Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) to explore the open vocabulary problem beyond object classes. It encompasses an open and diverse set of generalized knowledge, expressed as linguistic queries of fine-grained and object-specific attributes. To this end, we contribute a new benchmark named OpenScan, which consists of 3D object attributes across eight representative linguistic aspects, including affordance, property, material, and more. We further evaluate state-of-the-art OV-3D methods on our OpenScan benchmark, and discover that these methods struggle to comprehend the abstract vocabularies of the GOV-3D task, a challenge that cannot be addressed by simply scaling up object classes during training. We highlight the limitations of existing methodologies and explore a promising direction to overcome the identified shortcomings. Data and code are available at <a class="link-external link-https" href="https://github.com/YoujunZhao/OpenScan" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is the inadequacy of existing Open Vocabulary 3D Scene Understanding (OV-3D) methods in recognizing object attributes. Specifically, current OV-3D methods and benchmarks primarily focus on object categories and fail to comprehensively evaluate the model's understanding of 3D scenes. Therefore, the authors propose a more challenging task—Generalized Open Vocabulary 3D Scene Understanding (GOV-3D), which aims to explore open vocabulary issues beyond object categories. The GOV-3D task includes an open and diverse set of knowledge expressed through natural language queries, covering specific attributes of objects. To achieve this goal, the authors constructed a new benchmark dataset—OpenScan, which includes eight representative linguistic aspects of 3D object attributes, such as function, properties, materials, etc. By evaluating existing OV-3D methods on the OpenScan benchmark, the authors found that these methods have significant difficulties in understanding and recognizing abstract vocabulary, which cannot be resolved merely by increasing the number of object categories during training. The authors pointed out the limitations of existing methods and explored promising directions to overcome these limitations. In summary, the main contributions of this paper include: 1. Introducing the Generalized Open Vocabulary 3D Scene Understanding (GOV-3D) task, which extends the classic OV-3D task to achieve a broader understanding of 3D scenes. 2. Providing a new benchmark dataset, OpenScan, to comprehensively evaluate the generalization ability of OV-3D segmentation models on abstract object attributes. 3. Demonstrating through extensive experiments the inadequacies of existing OV-3D segmentation models in understanding abstract object attributes, emphasizing the importance of establishing comprehensive and reliable benchmarks.