GPT4Point: A Unified Framework for Point-Language Understanding and Generation

Zhangyang Qi,Ye Fang,Zeyi Sun,Xiaoyang Wu,Tong Wu,Jiaqi Wang,Dahua Lin,Hengshuang Zhao
2023-12-06
Abstract:Multimodal Large Language Models (MLLMs) have excelled in 2D image-text comprehension and image generation, but their understanding of the 3D world is notably deficient, limiting progress in 3D language understanding and generation. To solve this problem, we introduce GPT4Point, an innovative groundbreaking point-language multimodal model designed specifically for unified 3D object understanding and generation within the MLLM framework. GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text reference tasks such as point-cloud captioning and Q&A. Additionally, GPT4Point is equipped with advanced capabilities for controllable 3D generation, it can get high-quality results through a low-quality point-text feature maintaining the geometric shapes and colors. To support the expansive needs of 3D object-text pairs, we develop Pyramid-XL, a point-language dataset annotation engine. It constructs a large-scale database over 1M objects of varied text granularity levels from the Objaverse-XL dataset, essential for training GPT4Point. A comprehensive benchmark has been proposed to evaluate 3D point-language understanding capabilities. In extensive evaluations, GPT4Point has demonstrated superior performance in understanding and generation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily aims to address the following issues: 1. **Enhancing the understanding and generation capabilities of the 3D world**: Existing multimodal large language models (MLLMs) perform well in understanding and generating 2D image text but show significant deficiencies in understanding and generating the 3D world. The paper proposes a new framework, GPT4Point, which aims to improve this situation through a unified point cloud language understanding and generation approach. 2. **Developing multimodal models for 3D objects**: Existing 3D MLLMs usually focus on understanding entire scenes while neglecting the geometric details of individual objects or rely on information converted from 2D images, leading to a loss of geometric accuracy. GPT4Point focuses on individual 3D objects and directly utilizes point cloud data for training to overcome these limitations. 3. **Building a large-scale point cloud language dataset**: To address the scarcity of 3D point cloud language data, the paper proposes an automated data annotation engine, Pyramid-XL, and constructs a dataset containing 1 million pairs of point cloud language data at different levels of granularity based on the Objaverse-XL dataset. 4. **Achieving controllable high-quality 3D object generation**: By combining low-quality point cloud features and textual information, GPT4Point can generate high-quality 3D objects while maintaining specific shapes and colors, thus addressing the issue of random and uncontrollable textures in existing 3D generation models. 5. **Establishing a comprehensive evaluation benchmark**: To objectively evaluate model performance in 3D point cloud language tasks, the paper establishes a novel object-level point cloud benchmark, including evaluation metrics for 3D object recognition, text reasoning tasks, and 3D object generation. In summary, the main goal of the paper is to enhance the understanding and generation capabilities of the 3D world by introducing the GPT4Point framework and supporting this through the construction of a large-scale dataset and comprehensive evaluation benchmark.