PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning

Xiangyang Zhu,Renrui Zhang,Bowei He,Ziyu Guo,Ziyao Zeng,Zipeng Qin,Shanghang Zhang,Peng Gao
2023-08-27
Abstract:Large-scale pre-trained models have shown promising open-world performance for both vision and language tasks. However, their transferred capacity on 3D point clouds is still limited and only constrained to the classification task. In this paper, we first collaborate CLIP and GPT to be a unified 3D open-world learner, named as PointCLIP V2, which fully unleashes their potential for zero-shot 3D classification, segmentation, and detection. To better align 3D data with the pre-trained language knowledge, PointCLIP V2 contains two key designs. For the visual end, we prompt CLIP via a shape projection module to generate more realistic depth maps, narrowing the domain gap between projected point clouds with natural images. For the textual end, we prompt the GPT model to generate 3D-specific text as the input of CLIP's textual encoder. Without any training in 3D domains, our approach significantly surpasses PointCLIP by +42.90%, +40.44%, and +28.75% accuracy on three datasets for zero-shot 3D classification. On top of that, V2 can be extended to few-shot 3D classification, zero-shot 3D part segmentation, and 3D object detection in a simple manner, demonstrating our generalization ability for unified 3D open-world learning.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper mainly addresses the following issues: 1. **Unified 3D Open World Learning Framework**: A powerful framework named PointCLIP V2 is proposed, which combines the strengths of CLIP (Contrastive Language-Image Pre-training) and GPT-3 (Generative Pre-trained Transformer 3) for zero-shot classification, segmentation, and detection tasks on 3D point cloud data. This enables the model to perform complex 3D open world understanding tasks without having seen any 3D training data. 2. **Mitigating the 2D-3D Domain Gap**: To better utilize CLIP's pre-trained knowledge in the image domain, the paper proposes two key designs: one is to generate more realistic depth maps through a shape projection module to prompt CLIP, reducing the domain gap between 3D point cloud projections and natural images; the other is to prompt GPT-3 to generate text inputs rich in 3D semantics, enhancing the capability of the CLIP text encoder. 3. **Improving Zero-Shot Classification Performance**: On multiple 3D datasets, PointCLIP V2 achieves significant performance improvements in zero-shot 3D classification tasks compared to the previous version PointCLIP, such as increasing accuracy by +42.90%, +40.44%, and +28.75% on the ModelNet10, ModelNet40, and ScanObjectNN datasets, respectively. 4. **Extending to Other 3D Open World Tasks**: Besides zero-shot classification, PointCLIP V2 can be extended to few-shot classification, zero-shot part segmentation, and zero-shot 3D object detection tasks with simple modifications, demonstrating its powerful capability in unified 3D open world learning. 5. **Overcoming Limitations of Existing Methods**: Compared to the previous PointCLIP, PointCLIP V2 not only improves classification accuracy but also achieves unified handling of 3D open world tasks for the first time, including zero-shot part segmentation and zero-shot 3D object detection. In summary, this paper effectively addresses several key challenges in 3D open world learning by innovatively combining CLIP and GPT-3, and demonstrates significant performance improvements across various 3D tasks.