Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers

Haifeng Huang,Yilun Chen,Zehan Wang,Rongjie Huang,Runsen Xu,Tai Wang,Luping Liu,Xize Cheng,Yang Zhao,Jiangmiao Pang,Zhou Zhao
2024-09-28
Abstract:Recent advancements in 3D Large Language Models (LLMs) have demonstrated promising capabilities for 3D scene understanding. However, previous methods exhibit deficiencies in general referencing and grounding capabilities for intricate scene comprehension. In this paper, we introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Specifically, we decompose the input 3D scene into a set of object proposals, each assigned a unique identifier token, which enables efficient object referencing and grounding during user-assistant interactions. Given the scarcity of scene-language data, we model the scene embeddings as a sequence of explicit object-level embeddings, derived from semantic-rich 2D or 3D representations. By employing object identifiers, we transform diverse 3D scene-language tasks into a unified question-answering format, facilitating joint training without the need for additional task-specific heads. With minimal fine-tuning on all downstream tasks, our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue of the inadequacy of existing large language models (LLMs) in object referencing and localization in 3D scene understanding. Specifically, the paper points out the following problems in current methods when dealing with complex scene understanding: 1. **Insufficient object referencing capability**: Existing 3D LLMs perform poorly in understanding and referencing user-specified objects, especially in tasks requiring precise object referencing. 2. **Limited object localization capability**: Although some methods attempt to improve object localization by adding position tokens, these methods do not perform well in 3D benchmarks, mainly due to the scarcity of 3D scene-language data. 3. **Dependence on task-specific heads**: Many existing methods rely on task-specific head structures, which limits the model's generality and adaptability. To overcome these issues, the paper proposes a new approach that improves 3D scene understanding and interaction by introducing object identifiers and object-centric representations. Specifically, the main contributions of the paper include: - **Object-level 3D MLLM**: Modeling and interacting with 3D scenes through object identifiers, improving the efficiency of object referencing and localization. - **Unified task format**: Converting different 3D scene-language tasks into a unified question-answer format, enabling joint training without additional task-specific heads. - **Multimodal object-centric representation**: Utilizing pre-trained 2D and 3D models to extract object features and projecting them into the language model's embedding space through simple linear layers, mitigating the impact of scene-language data scarcity. Through these innovations, the proposed method significantly improves performance on multiple 3D scene understanding benchmarks, demonstrating its potential in complex real-world applications.