Abstract:Multi-view representation learning aims to derive robust representations that are both view-consistent and view-specific from diverse data sources. This paper presents an in-depth analysis of existing approaches in this domain, highlighting a commonly overlooked aspect: the redundancy between view-consistent and view-specific representations. To this end, we propose an innovative framework for multi-view representation learning, which incorporates a technique we term 'distilled disentangling'. Our method introduces the concept of masked cross-view prediction, enabling the extraction of compact, high-quality view-consistent representations from various sources without incurring extra computational overhead. Additionally, we develop a distilled disentangling module that efficiently filters out consistency-related information from multi-view representations, resulting in purer view-specific representations. This approach significantly reduces redundancy between view-consistent and view-specific representations, enhancing the overall efficiency of the learning process. Our empirical evaluations reveal that higher mask ratios substantially improve the quality of view-consistent representations. Moreover, we find that reducing the dimensionality of view-consistent representations relative to that of view-specific representations further refines the quality of the combined representations. Our code is accessible at:

What problem does this paper attempt to address?

This paper attempts to solve the redundancy problem in multi - view representation learning, that is, the information redundancy existing in the extraction of view - consistency and view - specificity representations. Specifically, the paper points out that current methods fail to effectively distinguish between view - consistency and view - specificity information when processing multi - view data, resulting in a high degree of redundancy in the learned representations. This redundancy not only affects the quality of the representation but also increases the computational burden of subsequent tasks. To meet this challenge, the paper proposes an innovative framework - "Distilled Disentangling". By introducing the "Masked Cross - View Prediction" (MCP) technique, it extracts compact and high - quality view - consistency representations from multiple data sources. At the same time, it efficiently filters out multi - view representation information related to consistency through the distilled disentangling module, thereby obtaining purer view - specificity representations. This method significantly reduces the redundancy between view - consistency and view - specificity representations and improves the overall efficiency of the learning process. ### Main Contributions 1. **Reveal the Fundamental Challenges in Multi - view Representation Learning**: By decoupling perspectives, it reveals the limitations existing in existing models, especially the redundancy problem between view - consistency and view - specificity representations. 2. **Propose a Multi - view Representation Learning Framework Based on Distilled Disentangling**: This framework provides a new method for constructing low - redundancy view - consistency and view - specificity representations. 3. **Experimental Verification**: Through extensive experimental analysis, it proves the superiority of this method on multiple benchmark datasets, especially under a high masking ratio, the quality of view - consistency representations is significantly improved; at the same time, reducing the dimension of view - consistency representations relative to the dimension of view - specificity representations further improves the quality of the combined representation. ### Method Overview 1. **Overall Architecture**: - **First Stage**: Use random masking techniques to process multi - view data, extract view - consistency representations through a consistent encoder, and then use multiple view - specific decoders to generate reconstructed views, which is called "Masked Cross - View Prediction". - **Second Stage**: Use a series of view - specific encoders to extract view - specificity representations, minimize the mutual information between view - consistency and view - specificity representations through a disentangling module, and finally obtain high - quality view - specificity representations. 2. **Masked Cross - View Prediction Consistency**: - Reduce the influence of view - specific information through random masking techniques, forcing the consistent encoder to learn view - consistency information from the visible parts. - Use the Variational Auto - Encoder (VAE) to construct MCP, assume that the prior distribution of view - consistency representations is a standard Gaussian distribution \( p(c)\sim\mathcal{N}(0, I) \), and optimize the objective function through the re - parameterization trick. 3. **Distilled Disentangling Specificity**: - Use the trained consistent encoder to generate view - consistency representations, and extract coarse - grained view - specificity representations through a series of view - specific encoders. - Minimize the upper bound of the mutual information between view - specificity and view - consistency representations using a disentangling module, which is implemented by the CLUB estimator. - Finally, simulate high - order interactions through view - specific decoders to ensure the reconstruction quality, thereby indirectly verifying the quality of the disentangled representations. ### Experimental Results The paper conducted experiments on five multi - view datasets, including E - MNIST, E - FMNIST, COIL - 20, COIL - 100 and Office - 31. The experimental results show that the proposed MRDD method outperforms the existing state - of - the - art methods in both clustering and classification tasks, especially under a high masking ratio, the quality of view - consistency representations is significantly improved. ### Conclusion By introducing the "Distilled Disentangling" framework, the paper effectively solves the redundancy problem in multi - view representation learning and improves the quality of representations and the efficiency of the learning process. These findings provide new ideas and directions for future multi - view representation learning research.

Rethinking Multi-view Representation Learning via Distilled Disentangling