Abstract:In order to effectively enhance the practicality of 3D model retrieval, we adopt a single real image as the query sample for retrieving 3D models. However, the significant differences between 2D images and 3D models in terms of lighting conditions, textures and backgrounds, posing a great challenge for accurate retrieval. Existing work on 3D model retrieval mainly focuses on closed-domain research, while the open-domain condition where the category relationship between the query image and the 3D model is unknown is more in line with the needs of real scenarios. CLIP shows significant promise in comprehending open-world visual concepts, facilitating effective zero-shot image recognition. Based on this multimodal pre-training large language model, we introduce Adaptive Open-domain Semantic Nearest-neighbor Contrast (AOSNC), a method for learning and aligning multi-modal text, image, and 3D model. In order to solve the issue of inconsistent cross-domain categories and difficult sample correlation in open-domain, we construct a cross-modal bridge using CLIP. This model utilizes textual features to bridge the gap between 2D images and 3D model views. Additionally, we design an adaptive network layer to address the limitations of the pre-training model for 3D model views and enhance cross-modal alignment. We propose a mutual nearest-neighbor semantic alignment loss to address the challenge of aligning features from disparate modalities (text, images, and 3D models). This loss function enhances cross-modal learning by effectively associating and distinguishing features, improving retrieval accuracy. We conducted comprehensive experiments using the image-based 3D model retrieval dataset MI3DOR and the cross-domain 3D model retrieval dataset NTU-PSB to validate the superiority of the proposed method. Our results show significant improvements in several evaluation metrics, underscoring the efficacy of our method in augmenting cross-modal feature alignment and retrieval performance.

Unsupervised self-training correction learning for 2D image-based 3D model retrieval

A Semantic Labeling Strategy to Reject Unknown Objects in Large Scale 3d Point Clouds

Self-supervised Image-based 3D Model Retrieval

CLN: Cross-Domain Learning Network for 2D Image-Based 3D Shape Retrieval

Adaptive CLIP for open-domain 3D model retrieval

Learning Transferable and Discriminative Representations for 2D Image-Based 3D Model Retrieval

ST3D++: Denoised Self-Training for Unsupervised Domain Adaptation on 3D Object Detection

MA-ST3D: Motion Associated Self-Training for Unsupervised Domain Adaptation on 3D Object Detection

Augment and Criticize: Exploring Informative Samples for Semi-Supervised Monocular 3D Object Detection

3D Self-Supervised Methods for Medical Imaging

Locating Target Regions for Image Retrieval in an Unsupervised Manner

STAL3D: Unsupervised Domain Adaptation for 3D Object Detection via Collaborating Self-Training and Adversarial Learning

Self-supervised learning via inter-modal reconstruction and feature projection networks for label-efficient 3D-to-2D segmentation

Bayesian Self-Training for Semi-Supervised 3D Segmentation

Single Image 3D Shape Retrieval Via Cross-Modal Instance and Category Contrastive Learning

Domain-Adversarial-Guided Siamese Network for Unsupervised Cross-Domain 3-D Object Retrieval

Pseudo Label Refinery for Unsupervised Domain Adaptation on Cross-dataset 3D Object Detection

Multi-class center dynamic contrastive learning for unsupervised domain adaptation person re-identification

RoMo: Robust Unsupervised Multimodal Learning with Noisy Pseudo Labels

SDCL: Students Discrepancy-Informed Correction Learning for Semi-supervised Medical Image Segmentation

ST3D: Self-training for Unsupervised Domain Adaptation on 3D Object Detection