Duoduo CLIP: Efficient 3D Understanding with Multi-View Images

Han-Hung Lee,Yiming Zhang,Angel X. Chang
2024-10-18
Abstract:We introduce Duoduo CLIP, a model for 3D representation learning that learns shape encodings from multi-view images instead of point-clouds. The choice of multi-view images allows us to leverage 2D priors from off-the-shelf CLIP models to facilitate fine-tuning with 3D data. Our approach not only shows better generalization compared to existing point cloud methods, but also reduces GPU requirements and training time. In addition, the model is modified with cross-view attention to leverage information across multiple frames of the object which further boosts performance. Notably, our model is permutation invariant to the order of multi-view images while being pose-free. Compared to the current SOTA point cloud method that requires 480 A100 hours to train 1 billion model parameters we only require 57 A5000 hours and 87 million parameters. Multi-view images also provide more flexibility including being able to encode objects with a variable number of images, and performance scales when more views are used. In contrast, point cloud based methods require an entire scan or model of the object. We showcase this flexibility with benchmarks from images of real-world objects. Our model also achieves better performance in more fine-grained text to shape retrieval, demonstrating better text-and-shape alignment than point cloud based models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to learn 3D shape representations from multi - view images more efficiently and accurately, rather than relying on point cloud data. Specifically, the authors proposed the Duoduo CLIP model, aiming to improve the existing 3D shape understanding methods in the following aspects: 1. **Using multi - view images instead of point clouds**: Most of the existing methods rely on point clouds to represent 3D shapes, but point clouds have limitations in resolution and the domain gap with real - world images. Duoduo CLIP uses multi - view images as input and can better utilize the prior knowledge in pre - trained 2D vision - language models (such as CLIP), thereby improving generalization ability and efficiency. 2. **Reducing the demand for computing resources**: Compared with the existing point cloud methods, Duoduo CLIP significantly reduces the training time and GPU resource requirements. For example, compared with the SOTA point cloud method that requires 480 A100 GPU hours to train 1 billion parameters, Duoduo CLIP only needs 57 A5000 GPU hours and 87 million parameters. 3. **Enhancing flexibility**: The multi - view image representation allows encoding objects with different numbers of views, and the performance improves as the number of views increases. In addition, the model is insensitive to the pose of the input image, further improving flexibility. 4. **Better text - to - shape retrieval**: Duoduo CLIP shows superior performance in fine - grained text - to - shape retrieval tasks, indicating that it is superior to point - cloud - based methods in text - and - shape alignment. ### Main contributions - **More efficient training**: By using multi - view images, the demand for large - scale GPU resources is reduced while maintaining high performance. - **Better generalization ability**: It shows stronger generalization ability on unseen shapes, especially performing better on real - world objects than the existing point cloud methods. - **Higher flexibility**: It supports a variable number of view inputs and adapts to different application scenarios, such as real - time robot applications. - **Improved text - to - shape retrieval**: It performs excellently in text - to - shape retrieval tasks with fine - grained descriptions, showing better text - and - shape alignment ability. ### Summary By introducing multi - view images as the core of 3D shape representation, Duoduo CLIP not only improves the efficiency and generalization ability of the model but also provides a more flexible and effective solution for 3D shape understanding.