Abstract:In this work, we tackle the problem of single image-based 3D shape retrieval (IBSR), where we seek to find the most matched shape of a given single 2D image from a shape repository. Most of the existing works learn to embed 2D images and 3D shapes into a common feature space and perform metric learning using a triplet loss. Inspired by the great success in recent contrastive learning works on self-supervised representation learning, we propose a novel IBSR pipeline leveraging contrastive learning. We note that adopting such cross-modal contrastive learning between 2D images and 3D shapes into IBSR tasks is non-trivial and challenging: contrastive learning requires very strong data augmentation in constructed positive pairs to learn the feature invariance, whereas traditional metric learning works do not have this requirement. Moreover, object shape and appearance are entangled in 2D query images, thus making the learning task more difficult than contrasting single-modal data. To mitigate the challenges, we propose to use multi-view grayscale rendered images from the 3D shapes as a shape representation. We then introduce a strong data augmentation technique based on color transfer, which can significantly but naturally change the appearance of the query image, effectively satisfying the need for contrastive learning. Finally, we propose to incorporate a novel category-level contrastive loss that helps distinguish similar objects from different categories, in addition to classic instance-level contrastive loss. Our experiments demonstrate that our approach achieves the best performance on all the three popular IBSR benchmarks, including Pix3D, Stanford Cars, and Comp Cars, outperforming the previous state-of-the-art from 4% - 15% on retrieval accuracy.

Learning SO(3)-Invariant Semantic Correspondence via Local Shape Transform

RISAS: A Novel Rotation, Illumination, Scale Invariant Appearance and Shape Feature.

Self-supervised Learning of Rotation-invariant 3D Point Set Features using Transformer and its Self-distillation

Stable and Consistent Prediction of 3D Characteristic Orientation via Invariant Residual Learning

Rethinking Local-to-global Representation Learning for Rotation-Invariant Point Cloud Analysis

Local-consistent Transformation Learning for Rotation-invariant Point Cloud Analysis

Equivariant Local Reference Frames for Unsupervised Non-rigid Point Cloud Shape Correspondence

Rethinking Rotation Invariance with Point Cloud Registration

Rotation-Invariant Transformer for Point Cloud Matching

Single Image 3D Shape Retrieval Via Cross-Modal Instance and Category Contrastive Learning

A Fast and Robust Rotation Search and Point Cloud Registration Method for 2D Stitching and 3D Object Localization

Rotation-Invariant Local-to-Global Representation Learning for 3D Point Cloud

SDF-SRN: Learning Signed Distance 3D Object Reconstruction from Static Images

Self-Supervised Learning for Multimodal Non-Rigid 3D Shape Matching

Self-supervised Learning of Implicit Shape Representation with Dense Correspondence for Deformable Objects

Evaluating 3D Shape Analysis Methods for Robustness to Rotation Invariance

Self-supervised rigid transformation equivariance for accurate 3D point cloud registration

Learning Implicit Functions for Dense 3D Shape Correspondence of Generic Objects

RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentation

Joint stereo 3D object detection and implicit surface reconstruction