Abstract:Large-scale pre-trained models have shown promising open-world performance for both vision and language tasks. However, their transferred capacity on 3D point clouds is still limited and only constrained to the classification task. In this paper, we first collaborate CLIP and GPT to be a unified 3D open-world learner, named as PointCLIP V2, which fully unleashes their potential for zero-shot 3D classification, segmentation, and detection. To better align 3D data with the pre-trained language knowledge, PointCLIP V2 contains two key designs. For the visual end, we prompt CLIP via a shape projection module to generate more realistic depth maps, narrowing the domain gap between projected point clouds with natural images. For the textual end, we prompt the GPT model to generate 3D-specific text as the input of CLIP's textual encoder. Without any training in 3D domains, our approach significantly surpasses PointCLIP by +42.90%, +40.44%, and +28.75% accuracy on three datasets for zero-shot 3D classification. On top of that, V2 can be extended to few-shot 3D classification, zero-shot 3D part segmentation, and 3D object detection in a simple manner, demonstrating our generalization ability for unified 3D open-world learning.

What problem does this paper attempt to address?

The paper mainly addresses the following issues: 1. **Unified 3D Open World Learning Framework**: A powerful framework named PointCLIP V2 is proposed, which combines the strengths of CLIP (Contrastive Language-Image Pre-training) and GPT-3 (Generative Pre-trained Transformer 3) for zero-shot classification, segmentation, and detection tasks on 3D point cloud data. This enables the model to perform complex 3D open world understanding tasks without having seen any 3D training data. 2. **Mitigating the 2D-3D Domain Gap**: To better utilize CLIP's pre-trained knowledge in the image domain, the paper proposes two key designs: one is to generate more realistic depth maps through a shape projection module to prompt CLIP, reducing the domain gap between 3D point cloud projections and natural images; the other is to prompt GPT-3 to generate text inputs rich in 3D semantics, enhancing the capability of the CLIP text encoder. 3. **Improving Zero-Shot Classification Performance**: On multiple 3D datasets, PointCLIP V2 achieves significant performance improvements in zero-shot 3D classification tasks compared to the previous version PointCLIP, such as increasing accuracy by +42.90%, +40.44%, and +28.75% on the ModelNet10, ModelNet40, and ScanObjectNN datasets, respectively. 4. **Extending to Other 3D Open World Tasks**: Besides zero-shot classification, PointCLIP V2 can be extended to few-shot classification, zero-shot part segmentation, and zero-shot 3D object detection tasks with simple modifications, demonstrating its powerful capability in unified 3D open world learning. 5. **Overcoming Limitations of Existing Methods**: Compared to the previous PointCLIP, PointCLIP V2 not only improves classification accuracy but also achieves unified handling of 3D open world tasks for the first time, including zero-shot part segmentation and zero-shot 3D object detection. In summary, this paper effectively addresses several key challenges in 3D open world learning by innovatively combining CLIP and GPT-3, and demonstrates significant performance improvements across various 3D tasks.

PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning

PointCLIP: Point Cloud Understanding by CLIP

CLIP^2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training

CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition

CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP

DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification

Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation

GPT4Point: A Unified Framework for Point-Language Understanding and Generation

Exploiting GPT-4 Vision for Zero-shot Point Cloud Understanding

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP

Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training

Joint Representation Learning for Text and 3D Point Cloud

Robust 3D Point Cloud Recognition: Enhancing Robustness with GPT-4 and CLIP Integration

Point-to-Pixel Prompting for Point Cloud Analysis With Pre-Trained Image Models

MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition

EPCL: Frozen CLIP Transformer is An Efficient Point Cloud Encoder

Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation