Abstract:We introduce Duoduo CLIP, a model for 3D representation learning that learns shape encodings from multi-view images instead of point-clouds. The choice of multi-view images allows us to leverage 2D priors from off-the-shelf CLIP models to facilitate fine-tuning with 3D data. Our approach not only shows better generalization compared to existing point cloud methods, but also reduces GPU requirements and training time. In addition, the model is modified with cross-view attention to leverage information across multiple frames of the object which further boosts performance. Notably, our model is permutation invariant to the order of multi-view images while being pose-free. Compared to the current SOTA point cloud method that requires 480 A100 hours to train 1 billion model parameters we only require 57 A5000 hours and 87 million parameters. Multi-view images also provide more flexibility including being able to encode objects with a variable number of images, and performance scales when more views are used. In contrast, point cloud based methods require an entire scan or model of the object. We showcase this flexibility with benchmarks from images of real-world objects. Our model also achieves better performance in more fine-grained text to shape retrieval, demonstrating better text-and-shape alignment than point cloud based models.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is how to learn 3D shape representations from multi - view images more efficiently and accurately, rather than relying on point cloud data. Specifically, the authors proposed the Duoduo CLIP model, aiming to improve the existing 3D shape understanding methods in the following aspects: 1. **Using multi - view images instead of point clouds**: Most of the existing methods rely on point clouds to represent 3D shapes, but point clouds have limitations in resolution and the domain gap with real - world images. Duoduo CLIP uses multi - view images as input and can better utilize the prior knowledge in pre - trained 2D vision - language models (such as CLIP), thereby improving generalization ability and efficiency. 2. **Reducing the demand for computing resources**: Compared with the existing point cloud methods, Duoduo CLIP significantly reduces the training time and GPU resource requirements. For example, compared with the SOTA point cloud method that requires 480 A100 GPU hours to train 1 billion parameters, Duoduo CLIP only needs 57 A5000 GPU hours and 87 million parameters. 3. **Enhancing flexibility**: The multi - view image representation allows encoding objects with different numbers of views, and the performance improves as the number of views increases. In addition, the model is insensitive to the pose of the input image, further improving flexibility. 4. **Better text - to - shape retrieval**: Duoduo CLIP shows superior performance in fine - grained text - to - shape retrieval tasks, indicating that it is superior to point - cloud - based methods in text - and - shape alignment. ### Main contributions - **More efficient training**: By using multi - view images, the demand for large - scale GPU resources is reduced while maintaining high performance. - **Better generalization ability**: It shows stronger generalization ability on unseen shapes, especially performing better on real - world objects than the existing point cloud methods. - **Higher flexibility**: It supports a variable number of view inputs and adapts to different application scenarios, such as real - time robot applications. - **Improved text - to - shape retrieval**: It performs excellently in text - to - shape retrieval tasks with fine - grained descriptions, showing better text - and - shape alignment ability. ### Summary By introducing multi - view images as the core of 3D shape representation, Duoduo CLIP not only improves the efficiency and generalization ability of the model but also provides a more flexible and effective solution for 3D shape understanding.

Duoduo CLIP: Efficient 3D Understanding with Multi-View Images

PointCLIP: Point Cloud Understanding by CLIP

MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition

CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP

CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

Rethinking Local-to-global Representation Learning for Rotation-Invariant Point Cloud Analysis

Unify 3D Shape Retrieval and Classification in One Framework

CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition

PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal Distillation for 3D Shape Recognition

CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training

End-to-End Learning Local Multi-View Descriptors for 3D Point Clouds

MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding

CLIP3D-AD: Extending CLIP for 3D Few-Shot Anomaly Detection with Multi-View Images Generation

Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation

CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning

PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning

ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding

Adaptive CLIP for open-domain 3D model retrieval

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP

Towards Unified Representation of Multi-Modal Pre-training for 3D Understanding via Differentiable Rendering