Abstract:In this study, we explore an alternative approach to enhance contrastive text-image-3D alignment in the absence of textual descriptions for 3D objects. We introduce two unsupervised methods, $I2I$ and $(I2L)^2$, which leverage CLIP knowledge about textual and 2D data to compute the neural perceived similarity between two 3D samples. We employ the proposed methods to mine 3D hard negatives, establishing a multimodal contrastive pipeline with hard negative weighting via a custom loss function. We train on different configurations of the proposed hard negative mining approach, and we evaluate the accuracy of our models in 3D classification and on the cross-modal retrieval benchmark, testing image-to-shape and shape-to-image retrieval. Results demonstrate that our approach, even without explicit text alignment, achieves comparable or superior performance on zero-shot and standard 3D classification, while significantly improving both image-to-shape and shape-to-image retrieval compared to previous methods.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of enhancing the alignment of text, 2D images and 3D data in contrastive learning in the absence of 3D object text descriptions. Specifically, the authors propose two unsupervised methods (I2I and (I2L)²), use the CLIP model's knowledge of text and 2D data to calculate the neural - perceptual similarity between 3D samples, and mine hard negatives through these similarities to improve contrastive training. #### Main problem background: 1. **Data scarcity**: Existing 3D datasets usually lack high - quality text descriptions, making it difficult to directly apply the CLIP model for multi - modal alignment. 2. **Limitations of existing methods**: Previous methods attempted to align 3D objects by generating text descriptions or using template text prompts, but the text descriptions generated by these methods are often lacking in descriptiveness and distinctiveness, and are prone to introducing biases. #### Solutions: - **I2I method**: Only use object views and the CLIP image encoder to calculate 3D similarity. - **(I2L)² method**: Combine images and detailed category prompts generated by large - language models (LLM) to solve the problem of missing color, texture and material information in image - based methods. #### Specific goals: - Propose a method that can achieve 3D classification and cross - modal retrieval without explicit text alignment. - Enhance contrastive training by mining hard negatives, thereby improving the performance of the model in zero - shot and standard 3D classification tasks. - Significantly outperform existing methods in cross - modal retrieval tasks, especially in image - to - shape and shape - to - image retrieval. ### Summary: The main contribution of this paper is to propose a new framework that uses the visual and text knowledge of the CLIP model to enhance contrastive learning of 3D data. In particular, in the absence of text descriptions, it improves the performance of 3D classification and cross - modal retrieval by mining hard negatives.

Can CLIP help CLIP in learning 3D?

Debiased Graph Contrastive Learning.

CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Adaptive CLIP for open-domain 3D model retrieval

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP

Finetuning CLIP to Reason about Pairwise Differences

Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

The Double-Ellipsoid Geometry of CLIP

SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text

CLIP Can Understand Depth

PointCLIP: Point Cloud Understanding by CLIP

Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

When Text and Images Don't Mix: Bias-Correcting Language-Image Similarity Scores for Anomaly Detection

CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision