Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

Ziyu Guo,Renrui Zhang,Xiangyang Zhu,Yiwen Tang,Xianzheng Ma,Jiaming Han,Kexin Chen,Peng Gao,Xianzhi Li,Hongsheng Li,Pheng-Ann Heng
2023-09-02
Abstract:We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. Guided by ImageBind, we construct a joint embedding space between 3D and multi-modalities, enabling many promising applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D open-world understanding. On top of this, we further present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions. By parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction data, but exhibits superior 3D and multi-modal question-answering capacity. We hope our work may cast a light on the community for extending 3D point clouds to multi-modality applications. Code is available at <a class="link-external link-https" href="https://github.com/ZiyuGuo99/Point-Bind_Point-LLM" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning,Multimedia
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is to develop a unified 3D multi - modal framework to achieve alignment between point clouds and other modalities (such as 2D images, language, audio, and video). Specifically, the paper introduces two main models: Point - Bind and Point - LLM. ### Point - Bind The goal of Point - Bind is to construct a joint embedding space to align 3D point clouds with multiple modalities (such as 2D images, text, audio, etc.). In this way, Point - Bind can support the following applications: 1. **Any - to - 3D Generation**: Generate 3D shapes based on any modality (text, image, audio, or point cloud). 2. **3D Embedding - space Arithmetic**: Achieve cross - modal semantic combination by adding 3D features to features of other modalities. 3. **3D Zero - shot Understanding**: Conduct 3D classification and understanding on new, unseen categories, and support open - world understanding based on audio. ### Point - LLM Point - LLM is a 3D large - language model (LLM). It injects the semantics of Point - Bind into a pre - trained LLM (such as LLaMA) through parameter - efficient fine - tuning techniques. The main contributions of Point - LLM include: 1. **3D Question Answering**: Be able to generate detailed answers based on 3D point clouds and other modal inputs and perform cross - modal reasoning. 2. **Data - and Parameter - efficiency**: Does not require any 3D instruction data for training, and only uses publicly available vision - language data for fine - tuning, saving a large amount of resources. 3. **3D and Multi - modal Reasoning**: Support reasoning based on multiple - modal inputs and generate answers that contain information from all input modalities. ### Summary The main purpose of the paper is to expand the application scenarios of 3D point clouds, especially in generation, understanding, and instruction following, by constructing a joint embedding space to align 3D point clouds with multiple modalities. The combination of Point - Bind and Point - LLM provides a powerful tool for 3D multi - modal tasks, enabling efficient and accurate 3D understanding and generation without relying on a large amount of 3D instruction data.