Abstract:This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and ego-centric 3D information. This combination is pivotal for interactive planning, where scene-level data supports global planning and ego-centric data is important for localization. Notably, we use ego-centric 3D frame features for feature alignment, an efficient technique that enhances the model's ability to align features of small objects within the scene. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning, offering new possibilities for sophisticated agent interactions in indoor settings.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper introduces **Scene-LLM**, a 3D Vision-Language Model (3D-VLM) that combines the capabilities of large language models (LLMs) to address a range of 3D visual understanding and reasoning tasks in indoor environments. Specifically, the paper addresses the following issues: 1. **Fusion of 3D Visual Information and Language Models**: - Most current Vision-Language Models (VLMs) have made progress in 2D visual language understanding but perform poorly when dealing with persistent 3D spatial information. Scene-LLM compensates for this by integrating scene-level and first-person perspective 3D information. 2. **Interactive Planning in Dynamic Scenes**: - Existing methods typically handle only static 3D scenes and struggle with interactive planning tasks involving scene changes. Scene-LLM supports interactive planning in dynamic environments by merging scene-level and first-person perspective information. 3. **3D Feature Representation and Update**: - To effectively process 3D visual information and align it with pre-trained language models, Scene-LLM proposes a hybrid representation method that retains dense spatial information and supports real-time updates of scene states. 4. **Multimodal Data Generation and Alignment**: - The paper also proposes a large-scale data generation method, including approximately 190,000 pairs of first-person perspective 3D-text pairs and about 500,000 pairs of scene-level data, to train the model for better understanding and executing tasks in 3D environments. With these improvements, Scene-LLM performs excellently in multiple benchmark tests, achieving state-of-the-art results particularly in 3D Visual Question Answering (3D-VQA) and interactive planning tasks.

Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

3D-LLM: Injecting the 3D World into Large Language Models

SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Language-Image Models with 3D Understanding

When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models

Grounded 3D-LLM with Referent Tokens

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

An Embodied Generalist Agent in 3D World

LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image

SceneGPT: A Language Model for 3D Scene Understanding

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Leveraging Large Language Models for Robot 3D Scene Understanding

3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

Query3D: LLM-Powered Open-Vocabulary Scene Segmentation with Language Embedded 3D Gaussian

OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning