Abstract:Nowadays, driven by the increasing concern on 3D techniques, resulting in the large-scale 3D data, 3D model classification has attracted enormous attention from both research and industry communities. Most of the current methods highly depend on sufficient labeled 3D models, which substantially restricts their scalability to novel classes with few annotated training data since it can increase the chance of overfitting. Besides, they only leverage single-modal information (either point cloud or multi-view information), and few works integrate these complementary information for 3D model representation. To overcome these problems, we propose a multi-modal meta-transfer fusion network (M TF), the key of which is to perform few-shot multi-modal representation for 3D model classification. Specifically, we first convert the original 3D data into both multi-view and point cloud modalities, and pre-train individual encoding networks on a large-scale dataset to obtain the optimal initial parameters, which is beneficial to few-shot learning tasks. Then, to enable the network to adjust to few-shot learning tasks, we update the parameters in Scaling and Shifting operation ( SS ), multi-modal representation fusion (MMRF) and the 3D model classifier to obtain optimal initialization parameters. Since the large-scale training parameters in feature extractors will increase the chance of overfitting, we freeze the feature extractor and introduce a SS operation to adjust its weights. Specifically, SS can reduce the number of training parameters up to 20% , which can effectively avoid overfitting. MMRF can adaptively integrate the multi-modal information based on their significance to the 3D model for a more robust 3D representation. Since there is no available dataset for evaluation, we build three 3D CAD datasets, Meta-ModalNet, Meta-ShapeNet and Meta-RGBD, for this new task and implement the representative methods for fair comparisons. Extensive experimental results can demonstrate the superiority of the proposed method.

Multi-modal fusion network guided by prior knowledge for 3D CAD model recognition

Hamming Embedding Sensitivity Guided Fusion Network for 3D Shape Representation.

Multi-Modal Meta-Transfer Fusion Network for Few-Shot 3D Model Classification

MANet: Multimodal Attention Network based Point- View fusion for 3D Shape Recognition

Multi-View Adaptive Fusion Network for 3D Object Detection

MM-Net: A MixFormer-Based Multi-Scale Network for Anatomical and Functional Image Fusion

AM3Net: Adaptive Mutual-learning-based Multimodal Data Fusion Network

Cascaded Multi-3D-view Fusion for 3D-Oriented Object Detection

CMDFusion: Bidirectional Fusion Network with Cross-modality Knowledge Distillation for LIDAR Semantic Segmentation

ACF-Net: Asymmetric Cascade Fusion for 3D Detection with LiDAR Point Clouds and Images

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection

Multi-modality Fusion Network for Action Recognition.

Multimodal MRI Volumetric Data Fusion With Convolutional Neural Networks

mmFUSION: Multimodal Fusion for 3D Objects Detection

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

MDC-RHT: Multi-Modal Medical Image Fusion via Multi-Dimensional Dynamic Convolution and Residual Hybrid Transformer

Cascade fusion of multi-modal and multi-source feature fusion by the attention for three-dimensional object detection

Multi-Modal 3D Object Detection by Box Matching

From One to Many: Dynamic Cross Attention Networks for LiDAR and Camera Fusion

AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection

MMAF-Net: Multi-view multi-stage adaptive fusion for multi-sensor 3D object detection