Abstract:As two fundamental representation modalities of 3D objects, 3D point clouds and multi-view 2D images record shape information from different domains of geometric structures and visual appearances. In the current deep learning era, remarkable progress in processing such two data modalities has been achieved through respectively customizing compatible 3D and 2D network architectures. However, unlike multi-view image-based 2D visual modeling paradigms, which have shown leading performance in several common 3D shape recognition benchmarks, point cloud-based 3D geometric modeling paradigms are still highly limited by insufficient learning capacity, due to the difficulty of extracting discriminative features from irregular geometric signals. In this paper, we explore the possibility of boosting deep 3D point cloud encoders by transferring visual knowledge extracted from deep 2D image encoders under a standard teacher-student distillation workflow. Generally, we propose PointMCD, a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student. To perform heterogeneous feature alignment between 2D visual and 3D geometric domains, we further investigate visibility-aware feature projection (VAFP), by which point-wise embeddings are reasonably aggregated into view-specific geometric descriptors. By pair-wisely aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification. Experiments on 3D shape classification, part segmentation, and unsupervised learning strongly validate the effectiveness of our method. The code and data will be publicly available at <a class="link-external link-https" href="https://github.com/keeganhk/PointMCD" rel="external noopener nofollow">this https URL</a>.

3D Shape Classification Based on Global and Local Features Extraction with Collaborative Learning

Discriminatively Learning for Representing Local Image Features with Quadruplet Model

Local Deep Feature Learning Framework for 3D Shape.

Unify 3D Shape Retrieval and Classification in One Framework

A Unified Feature Representation and Learning Framework for 3D Shape

LATFormer: Locality-Aware Point-View Fusion Transformer for 3D shape recognition

OVPT: Optimal Viewset Pooling Transformer for 3D Object Recognition.

Learning Point Cloud Shapes with Geometric and Topological Structures.

Learning the Global Descriptor for 3-D Object Recognition Based on Multiple Views Decomposition

Parts4Feature: Learning 3D Global Features from Generally Semantic Parts in Multiple Views

3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification

3D shape classification based on convolutional neural networks fusing multi-view information

Learning Discriminative and Generative Shape Embeddings for Three-Dimensional Shape Retrieval

A Transformer-Based Capsule Network for 3D Part–Whole Relationship Learning

3D2seqviews: Aggregating Sequential Views for 3D Global Feature Learning by CNN with Hierarchical Attention Aggregation

Learning a discriminative deformation-invariant 3D shape descriptor via many-to-one encoder.

LIMAN: Local Information based Multi Attention Network for 3D Shape Recognition

PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal Distillation for 3D Shape Recognition

Learning point cloud context information based on 3D transformer for more accurate and efficient classification

Learning View-Based Graph Convolutional Network for Multi-View 3D Shape Analysis