Can CLIP help CLIP in learning 3D?

Cristian Sbrolli,Matteo Matteucci
2024-06-04
Abstract:In this study, we explore an alternative approach to enhance contrastive text-image-3D alignment in the absence of textual descriptions for 3D objects. We introduce two unsupervised methods, $I2I$ and $(I2L)^2$, which leverage CLIP knowledge about textual and 2D data to compute the neural perceived similarity between two 3D samples. We employ the proposed methods to mine 3D hard negatives, establishing a multimodal contrastive pipeline with hard negative weighting via a custom loss function. We train on different configurations of the proposed hard negative mining approach, and we evaluate the accuracy of our models in 3D classification and on the cross-modal retrieval benchmark, testing image-to-shape and shape-to-image retrieval. Results demonstrate that our approach, even without explicit text alignment, achieves comparable or superior performance on zero-shot and standard 3D classification, while significantly improving both image-to-shape and shape-to-image retrieval compared to previous methods.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of enhancing the alignment of text, 2D images and 3D data in contrastive learning in the absence of 3D object text descriptions. Specifically, the authors propose two unsupervised methods (I2I and (I2L)²), use the CLIP model's knowledge of text and 2D data to calculate the neural - perceptual similarity between 3D samples, and mine hard negatives through these similarities to improve contrastive training. #### Main problem background: 1. **Data scarcity**: Existing 3D datasets usually lack high - quality text descriptions, making it difficult to directly apply the CLIP model for multi - modal alignment. 2. **Limitations of existing methods**: Previous methods attempted to align 3D objects by generating text descriptions or using template text prompts, but the text descriptions generated by these methods are often lacking in descriptiveness and distinctiveness, and are prone to introducing biases. #### Solutions: - **I2I method**: Only use object views and the CLIP image encoder to calculate 3D similarity. - **(I2L)² method**: Combine images and detailed category prompts generated by large - language models (LLM) to solve the problem of missing color, texture and material information in image - based methods. #### Specific goals: - Propose a method that can achieve 3D classification and cross - modal retrieval without explicit text alignment. - Enhance contrastive training by mining hard negatives, thereby improving the performance of the model in zero - shot and standard 3D classification tasks. - Significantly outperform existing methods in cross - modal retrieval tasks, especially in image - to - shape and shape - to - image retrieval. ### Summary: The main contribution of this paper is to propose a new framework that uses the visual and text knowledge of the CLIP model to enhance contrastive learning of 3D data. In particular, in the absence of text descriptions, it improves the performance of 3D classification and cross - modal retrieval by mining hard negatives.