Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Rao Fu,Jingyu Liu,Xilun Chen,Yixin Nie,Wenhan Xiong
2024-03-23
Abstract:This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and ego-centric 3D information. This combination is pivotal for interactive planning, where scene-level data supports global planning and ego-centric data is important for localization. Notably, we use ego-centric 3D frame features for feature alignment, an efficient technique that enhances the model's ability to align features of small objects within the scene. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning, offering new possibilities for sophisticated agent interactions in indoor settings.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper introduces **Scene-LLM**, a 3D Vision-Language Model (3D-VLM) that combines the capabilities of large language models (LLMs) to address a range of 3D visual understanding and reasoning tasks in indoor environments. Specifically, the paper addresses the following issues: 1. **Fusion of 3D Visual Information and Language Models**: - Most current Vision-Language Models (VLMs) have made progress in 2D visual language understanding but perform poorly when dealing with persistent 3D spatial information. Scene-LLM compensates for this by integrating scene-level and first-person perspective 3D information. 2. **Interactive Planning in Dynamic Scenes**: - Existing methods typically handle only static 3D scenes and struggle with interactive planning tasks involving scene changes. Scene-LLM supports interactive planning in dynamic environments by merging scene-level and first-person perspective information. 3. **3D Feature Representation and Update**: - To effectively process 3D visual information and align it with pre-trained language models, Scene-LLM proposes a hybrid representation method that retains dense spatial information and supports real-time updates of scene states. 4. **Multimodal Data Generation and Alignment**: - The paper also proposes a large-scale data generation method, including approximately 190,000 pairs of first-person perspective 3D-text pairs and about 500,000 pairs of scene-level data, to train the model for better understanding and executing tasks in 3D environments. With these improvements, Scene-LLM performs excellently in multiple benchmark tests, achieving state-of-the-art results particularly in 3D Visual Question Answering (3D-VQA) and interactive planning tasks.