Unified Scene Representation and Reconstruction for 3D Large Language Models

Tao Chu,Pan Zhang,Xiaoyi Dong,Yuhang Zang,Qiong Liu,Jiaqi Wang

2024-04-20

Abstract:Enabling Large Language Models (LLMs) to interact with 3D environments is challenging. Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models. Text-image aligned 2D features from CLIP are then lifted to point clouds, which serve as inputs for LLMs. However, this solution lacks the establishment of 3D point-to-point connections, leading to a deficiency of spatial structure information. Concurrently, the absence of integration and unification between the geometric and semantic representations of the scene culminates in a diminished level of 3D scene understanding. In this paper, we demonstrate the importance of having a unified scene representation and reconstruction framework, which is essential for LLMs in 3D scenes. Specifically, we introduce Uni3DR^2 extracts 3D geometric and semantic aware representation features via the frozen pre-trained 2D foundation models (e.g., CLIP and SAM) and a multi-scale aggregate 3D decoder. Our learned 3D representations not only contribute to the reconstruction process but also provide valuable knowledge for LLMs. Experimental results validate that our Uni3DR^2 yields convincing gains over the baseline on the 3D reconstruction dataset ScanNet (increasing F-Score by +1.8\%). When applied to LLMs, our Uni3DR^2-LLM exhibits superior performance over the baseline on the 3D vision-language understanding dataset ScanQA (increasing BLEU-1 by +4.0\% and +4.2\% on the val set and test set, respectively). Furthermore, it outperforms the state-of-the-art method that uses additional GT point clouds on both ScanQA and 3DMV-VQA.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper proposes a solution to the challenges faced when interacting with 3D environments using large-scale language models (LLMs). Existing methods extract point clouds from 3D scenes reconstructed from ground truth (GT) geometry or auxiliary models and then elevate 2D features to point clouds as inputs for LLMs. However, this approach lacks the establishment of 3D point-to-point connections, resulting in insufficient spatial structural information. Additionally, the integration and unification between scene geometry and semantic representation are inadequate, reducing the level of understanding of 3D scenes. To address these issues, the paper introduces a unified scene representation and reconstruction framework (Uni3DR2). It extracts 3D geometry and semantic perception features using frozen 2D base models (such as CLIP and SAM) and utilizes a multiscale aggregation 3D decoder. The learned 3D representation not only assists the reconstruction process but also provides valuable knowledge for LLMs. Experimental results show that Uni3DR2 improves F-Score (+1.8%) on the 3D reconstruction dataset ScanNet and enhances BLEU-1 (validation set +4.0%, test set +4.2%) on the 3D visual language understanding dataset ScanQA. Furthermore, it outperforms state-of-the-art methods using additional GT point clouds on ScanQA and 3DMV-VQA. The main contribution of the paper is the introduction of Uni3DR2, which is a concise and unified module capable of generating rich 3D semantic and object details while avoiding expensive GT point clouds or performance limitations dependent on predicted point clouds. In this way, the capability of LLMs in performing visual language tasks in 3D environments is significantly improved.

Unified Scene Representation and Reconstruction for 3D Large Language Models

Uni3DL: Unified Model for 3D and Language Understanding

Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models

Uni3D: Exploring Unified 3D Representation at Scale

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

3D-LLM: Injecting the 3D World into Large Language Models

UniPLV: Towards Label-Efficient Open-World 3D Scene Understanding by Regional Visual Language Supervision

Grounded 3D-LLM with Referent Tokens

Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

A Unified Framework for 3D Scene Understanding

UniScene: Multi-Camera Unified Pre-training via 3D Scene Reconstruction for Autonomous Driving

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

Towards CLIP-driven Language-free 3D Visual Grounding Via 2D-3D Relational Enhancement and Consistency

ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues

ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences