Abstract:Recent advancements in 3D Large Language Models (LLMs) have demonstrated promising capabilities for 3D scene understanding. However, previous methods exhibit deficiencies in general referencing and grounding capabilities for intricate scene comprehension. In this paper, we introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Specifically, we decompose the input 3D scene into a set of object proposals, each assigned a unique identifier token, which enables efficient object referencing and grounding during user-assistant interactions. Given the scarcity of scene-language data, we model the scene embeddings as a sequence of explicit object-level embeddings, derived from semantic-rich 2D or 3D representations. By employing object identifiers, we transform diverse 3D scene-language tasks into a unified question-answering format, facilitating joint training without the need for additional task-specific heads. With minimal fine-tuning on all downstream tasks, our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.

What problem does this paper attempt to address?

The paper attempts to address the issue of the inadequacy of existing large language models (LLMs) in object referencing and localization in 3D scene understanding. Specifically, the paper points out the following problems in current methods when dealing with complex scene understanding: 1. **Insufficient object referencing capability**: Existing 3D LLMs perform poorly in understanding and referencing user-specified objects, especially in tasks requiring precise object referencing. 2. **Limited object localization capability**: Although some methods attempt to improve object localization by adding position tokens, these methods do not perform well in 3D benchmarks, mainly due to the scarcity of 3D scene-language data. 3. **Dependence on task-specific heads**: Many existing methods rely on task-specific head structures, which limits the model's generality and adaptability. To overcome these issues, the paper proposes a new approach that improves 3D scene understanding and interaction by introducing object identifiers and object-centric representations. Specifically, the main contributions of the paper include: - **Object-level 3D MLLM**: Modeling and interacting with 3D scenes through object identifiers, improving the efficiency of object referencing and localization. - **Unified task format**: Converting different 3D scene-language tasks into a unified question-answer format, enabling joint training without additional task-specific heads. - **Multimodal object-centric representation**: Utilizing pre-trained 2D and 3D models to extract object features and projecting them into the language model's embedding space through simple linear layers, mitigating the impact of scene-language data scarcity. Through these innovations, the proposed method significantly improves performance on multiple 3D scene understanding benchmarks, demonstrating its potential in complex real-world applications.

Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers

Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes

Leveraging Large Language Models for Robot 3D Scene Understanding

Language-Assisted 3D Feature Learning for Semantic Scene Understanding

Grounded 3D-LLM with Referent Tokens

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

3D-LLM: Injecting the 3D World into Large Language Models

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models

PLA: Language-Driven Open-Vocabulary 3D Scene Understanding

Unified Scene Representation and Reconstruction for 3D Large Language Models

ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Transcrib3D: 3D Referring Expression Resolution through Large Language Models

Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding

Language-Image Models with 3D Understanding

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction