Abstract:This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding. Project page: <a class="link-external link-https" href="https://qizekun.github.io/shapellm/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The paper aims to address key challenges in 3D object understanding and interaction, and proposes ShapeLLM, which is the first 3D Multimodal Large Language Model designed specifically for entity interaction and spatial intelligence. ShapeLLM is built on the improved 3D encoder ReCon++, which enhances geometric understanding through distillation of multi-view images. It is evaluated on a newly constructed benchmark, 3D MM-Vet, which covers tasks such as entity visual localization and scene understanding. Specifically, ShapeLLM tackles the following problems: 1. Accurate capture of 3D geometric information: To achieve precise spatial and structural processing, 3D shape understanding requires capturing sufficient 3D geometric information. 2. Entity interaction understanding: The model needs to possess basic knowledge of entity interactions to understand the functionality of objects, such as how to operate them. 3. Establishment of a universal interface: A universal interface is needed as a bridge between information encoding and decoding, to facilitate the conversion of high-level instructions into agent responses, such as dialogue responses and entity feedback. The proposed method, ShapeLLM, addresses these requirements and includes the following innovations: - Using 3D point clouds as input, which provides a more accurate representation of the physical environment compared to 2D images. - Incorporating multi-view distillation technique, which enhances the understanding of multi-level features through adaptive selection matching using the Hungarian algorithm. - Leveraging instruction following fine-tuning to train the model using constructed language output data, enabling unified processing of various 3D understanding tasks. Experimental results demonstrate that the improved 3D encoder ReCon++ achieves state-of-the-art performance on downstream fine-tuning and zero-shot 3D object recognition tasks, particularly showing significant improvement on the challenging ScanObjectNN dataset. Additionally, ShapeLLM performs impressively on the newly constructed 3D MM-Vet benchmark, surpassing the previous best records. This marks an important step in utilizing large language models for research on entity-object interaction.

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

Language-Image Models with 3D Understanding

Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

3D-LLM: Injecting the 3D World into Large Language Models

When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Grounded 3D-LLM with Referent Tokens

Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

3D Spatial Understanding in MLLMs: Disambiguation and Evaluation

LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream

Unified Scene Representation and Reconstruction for 3D Large Language Models

Uni3DL: Unified Model for 3D and Language Understanding

InfMLLM: A Unified Framework for Visual-Language Tasks.

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World