Abstract:Multi-view (or -modality) representation learning aims to understand the relationships between different view representations. Existing methods disentangle multi-view representations into consistent and view-specific representations by introducing strong inductive biases, which can limit their generalization ability. In this paper, we propose a novel multi-view representation disentangling method that aims to go beyond inductive biases, ensuring both interpretability and generalizability of the resulting representations. Our method is based on the observation that discovering multi-view consistency in advance can determine the disentangling information boundary, leading to a decoupled learning objective. We also found that the consistency can be easily extracted by maximizing the transformation invariance and clustering consistency between views. These observations drive us to propose a two-stage framework. In the first stage, we obtain multi-view consistency by training a consistent encoder to produce semantically-consistent representations across views as well as their corresponding pseudo-labels. In the second stage, we disentangle specificity from comprehensive representations by minimizing the upper bound of mutual information between consistent and comprehensive representations. Finally, we reconstruct the original data by concatenating pseudo-labels and view-specific representations. Our experiments on four multi-view datasets demonstrate that our proposed method outperforms 12 comparison methods in terms of clustering and classification performance. The visualization results also show that the extracted consistency and specificity are compact and interpretable. Our code can be found at \url{<a class="link-external link-https" href="https://github.com/Guanzhou-Ke/DMRIB" rel="external noopener nofollow">this https URL</a>}.

Learning Disentangled Representation for Multi-View 3D Object Recognition.

Learning the Global Descriptor for 3-D Object Recognition Based on Multiple Views Decomposition

Variable-Viewpoint Representations for 3D Object Recognition

Multi-view Moments Embedding Network for 3D Shape Recognition

Learning Relationships For Multi-View 3d Object Recognition

OVPT: Optimal Viewset Pooling Transformer for 3D Object Recognition.

Self-Supervised Multi-View Learning via Auto-Encoding 3D Transformations

Disentangling Multi-view Representations Beyond Inductive Bias

Multi-View 3d Object Retrieval with Deep Embedding Network

Multi-view dual attention network for 3D object recognition

View-relation Constrained Global Representation Learning for Multi-View-based 3D Object Recognition

Disentangling 3D/4D Facial Affect Recognition with Faster Multi-View Transformer

A Unified Feature Representation and Learning Framework for 3D Shape

ReINView: Re-interpreting Views for Multi-view 3D Object Recognition

Deep Learning Multi-View Representation for Face Recognition

View-based weight network for 3D object recognition

Learning Canonical View Representation for 3D Shape Recognition with Arbitrary Views

Rethinking Multi-view Representation Learning via Distilled Disentangling

Multiview Compressive Coding for 3D Reconstruction

ViewFormer: View Set Attention for Multi-view 3D Shape Understanding