Abstract:The traditional 3D object retrieval (3DOR) task is under the close-set setting, which assumes the categories of objects in the retrieval stage are all seen in the training stage. Existing methods under this setting may tend to only lazily discriminate their categories, while not learning a generalized 3D object embedding. Under such circumstances, it is still a challenging and open problem in real-world applications due to the existence of various unseen categories. In this paper, we first introduce the open-set 3DOR task to expand the applications of the traditional 3DOR task. Then, we propose the Hypergraph-Based Multi-Modal Representation (HGM 2 R) framework to learn 3D object embeddings from multi-modal representations under the open-set setting. The proposed framework is composed of two modules, i.e., the Multi-Modal 3D Object Embedding (MM3DOE) module and the Structure-Aware and Invariant Knowledge Learning (SAIKL) module. By utilizing the collaborative information of modalities derived from the same 3D object, the MM3DOE module is able to overcome the distinction across different modality representations and generate unified 3D object embeddings. Then, the SAIKL module utilizes the constructed hypergraph structure to model the high-order correlation among 3D objects from both seen and unseen categories. The SAIKL module also includes a memory bank that stores typical representations of 3D objects. By aligning with those memory anchors in the memory bank, the aligned embeddings can integrate the invariant knowledge to exhibit a powerful generalized capacity toward unseen categories. We formally prove that hypergraph modeling has better representative capability on data correlation than graph modeling. We generate four multi-modal datasets for the open-set 3DOR task, i.e., OS-ESB-core, OS-NTU-core, OS-MN40-core, and OS-ABO-core, in which each 3D object contains three modality representations: multi-view, point clouds, and voxel. Experiments on these four datasets show that the proposed method can significantly outperform existing methods. In particular, the proposed method outperforms the state-of-the-art by 12.12%/12.88% in terms of mAP on the OS-MN40-core/OS-ABO-core dataset, respectively. Results and visualizations demonstrate that the proposed method can effectively extract the generalized 3D object embeddings on the open-set 3DOR task and achieve satisfactory performance.

Structure-Aware Residual-Center Representation for Self-Supervised Open-Set 3D Cross-Modal Retrieval

Hypergraph-Based Multi-Modal Representation for Open-Set 3D Object Retrieval

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Deep Multi-Graph Hierarchical Enhanced Semantic Representation for Cross-Modal Retrieval

CAMVR: Context-Adaptive Multi-View Representation Learning for Dense Retrieval

Adversarial Cross-Modal Retrieval

Universal unsupervised cross-domain 3D shape retrieval

Deep Supervised Cross-Modal Retrieval

Semantic Feature Learning for Universal Unsupervised Cross-Domain Retrieval

Self-supervised Image-based 3D Model Retrieval

Federated learning for supervised cross-modal retrieval

RoMo: Robust Unsupervised Multimodal Learning with Noisy Pseudo Labels

SCA-PVNet: Self-and-Cross Attention Based Aggregation of Point Cloud and Multi-View for 3D Object Retrieval

Category-Oriented Representation Learning for Image to Multi-Modal Retrieval

Learning Discriminative Representations for Semantic Cross Media Retrieval

Multi-Modal Coreference Resolution with the Correlation between Space Structures

Adaptive CLIP for open-domain 3D model retrieval

A Unified Framework for Cross-Modality 3D Model Retrieval

Self-supervised Correlation Learning for Cross-Modal Retrieval

COM3D: Leveraging Cross-View Correspondence and Cross-Modal Mining for 3D Retrieval