ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Zekun Qi,Runpei Dong,Shaochen Zhang,Haoran Geng,Chunrui Han,Zheng Ge,Li Yi,Kaisheng Ma
2024-07-12
Abstract:This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding. Project page: <a class="link-external link-https" href="https://qizekun.github.io/shapellm/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address key challenges in 3D object understanding and interaction, and proposes ShapeLLM, which is the first 3D Multimodal Large Language Model designed specifically for entity interaction and spatial intelligence. ShapeLLM is built on the improved 3D encoder ReCon++, which enhances geometric understanding through distillation of multi-view images. It is evaluated on a newly constructed benchmark, 3D MM-Vet, which covers tasks such as entity visual localization and scene understanding. Specifically, ShapeLLM tackles the following problems: 1. Accurate capture of 3D geometric information: To achieve precise spatial and structural processing, 3D shape understanding requires capturing sufficient 3D geometric information. 2. Entity interaction understanding: The model needs to possess basic knowledge of entity interactions to understand the functionality of objects, such as how to operate them. 3. Establishment of a universal interface: A universal interface is needed as a bridge between information encoding and decoding, to facilitate the conversion of high-level instructions into agent responses, such as dialogue responses and entity feedback. The proposed method, ShapeLLM, addresses these requirements and includes the following innovations: - Using 3D point clouds as input, which provides a more accurate representation of the physical environment compared to 2D images. - Incorporating multi-view distillation technique, which enhances the understanding of multi-level features through adaptive selection matching using the Hungarian algorithm. - Leveraging instruction following fine-tuning to train the model using constructed language output data, enabling unified processing of various 3D understanding tasks. Experimental results demonstrate that the improved 3D encoder ReCon++ achieves state-of-the-art performance on downstream fine-tuning and zero-shot 3D object recognition tasks, particularly showing significant improvement on the challenging ScanObjectNN dataset. Additionally, ShapeLLM performs impressively on the newly constructed 3D MM-Vet benchmark, surpassing the previous best records. This marks an important step in utilizing large language models for research on entity-object interaction.