PointCMC: cross-modal multi-scale correspondences learning for point cloud understanding

Honggu Zhou,Xiaogang Peng,Yikai Luo,Zizhao Wu
DOI: https://doi.org/10.1007/s00530-024-01335-7
IF: 3.9
2024-05-01
Multimedia Systems
Abstract:Existing cross-modal frameworks have achieved impressive performance in point cloud object representations learning, where a 2D image encoder is employed to transfer knowledge to a 3D point cloud encoder. However, the local structures between point clouds and corresponding images are unaligned, which results in a challenge for the 3D point cloud encoder to learn fine-grained image-point cloud interactions. In this paper, we introduce a novel multi-scale training strategy (PointCMC) to enhance fine-grained cross-modal knowledge transfer in the cross-modal framework. Specifically, we design a Local-to-Local (L2L) module that implicitly learns the correspondence of local features by aligning and fusing extracted local feature sets. Moreover, we introduce the Cross-Modal Local-Global Contrastive (CLGC) loss, which enables the encoder to capture discriminative features by reasoning local structures to their corresponding cross-modal global shape. The extensive experimental results demonstrate that our approach outperforms the previous unsupervised learning methods in various downstream tasks such as 3D object classification and semantic segmentation.
computer science, information systems, theory & methods
What problem does this paper attempt to address?