Multi-Trusted Cross-Modal Information Bottleneck for 3D Self-Supervised Representation Learning
Haozhe Cheng,Xu Han,Pengcheng Shi,Jihua Zhu,Zhongyu Li
DOI: https://doi.org/10.1016/j.knosys.2023.111217
IF: 8.139
2023-01-01
Knowledge-Based Systems
Abstract:Mainstream 2D-3D multi-modal contrastive learning methods perform similarity clustering on extracted features of different modality data, such as color and spatial coordinates, to capture modality representation. However, the basic difference and noise of the data determine that not all information is beneficial to the contrastive task and may result in overfitting. Furthermore, improper fusion of multi-modal representations undermines information integrity. To address these challenges, this paper proposes a new 3D self-supervised contrastive learning method called Multi-Trusted Cross-Modal Information Bottleneck (MCIB), which filters out irrelevant information and fuses multi-modal features guided by belief and uncertainty. On one hand, Multi-Modal Information Bottleneck (MMIB) suppresses useless information that disturbs contrast by defining the lower bound of information propagation, which improves representation robustness and alleviates overfitting. On the other hand, Multi-Trusted Contrastive Learning (MTCL) regards the filtered descriptors as the trusted evidences, and then the uncertainty represented by each modality is explored by the Dirichlet distribution transformation. After that, Dempster-Shafer theory integrates the probability distribution of multi-modal representation according to belief and uncertainty, and the trusted contrastive clustering will be achieved. Empirical experiments, ablation studies, confirmatory experiments and robustness testing on public datasets and different backbones have confirmed the exceptional performance and robustness of MCIB and its sub-methods, MMIB and MTCL, in object and few-shot classification, and part segmentation.