Abstract:In order to effectively enhance the practicality of 3D model retrieval, we adopt a single real image as the query sample for retrieving 3D models. However, the significant differences between 2D images and 3D models in terms of lighting conditions, textures and backgrounds, posing a great challenge for accurate retrieval. Existing work on 3D model retrieval mainly focuses on closed-domain research, while the open-domain condition where the category relationship between the query image and the 3D model is unknown is more in line with the needs of real scenarios. CLIP shows significant promise in comprehending open-world visual concepts, facilitating effective zero-shot image recognition. Based on this multimodal pre-training large language model, we introduce Adaptive Open-domain Semantic Nearest-neighbor Contrast (AOSNC), a method for learning and aligning multi-modal text, image, and 3D model. In order to solve the issue of inconsistent cross-domain categories and difficult sample correlation in open-domain, we construct a cross-modal bridge using CLIP. This model utilizes textual features to bridge the gap between 2D images and 3D model views. Additionally, we design an adaptive network layer to address the limitations of the pre-training model for 3D model views and enhance cross-modal alignment. We propose a mutual nearest-neighbor semantic alignment loss to address the challenge of aligning features from disparate modalities (text, images, and 3D models). This loss function enhances cross-modal learning by effectively associating and distinguishing features, improving retrieval accuracy. We conducted comprehensive experiments using the image-based 3D model retrieval dataset MI3DOR and the cross-domain 3D model retrieval dataset NTU-PSB to validate the superiority of the proposed method. Our results show significant improvements in several evaluation metrics, underscoring the efficacy of our method in augmenting cross-modal feature alignment and retrieval performance.

InterCLIP: Adapting CLIP To Interactive Image Retrieval with Triplet Similarity

Mclip: Multilingual CLIP Via Cross-lingual Transfer.

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

CLIP-Branches: Interactive Fine-Tuning for Text-Image Retrieval

Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

Adaptive CLIP for open-domain 3D model retrieval

Animating Images to Transfer CLIP for Video-Text Retrieval

Image–Text Matching Model Based on CLIP Bimodal Encoding

Unveiling the Power of CLIP in Unsupervised Visible-Infrared Person Re-Identification

Enhancing Image Retrieval : A Comprehensive Study on Photo Search using the CLIP Mode

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning

CLIP-Based Composed Image Retrieval with Comprehensive Fusion and Data Augmentation.

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Turning a CLIP modal into image-text matching

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

DiffCLIP: Few-shot Language-driven Multimodal Classifier

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want