Abstract:Maximum Manifold Capacity Representations (MMCR) is a recent multi-view self-supervised learning (MVSSL) method that matches or surpasses other leading MVSSL methods. MMCR is intriguing because it does not fit neatly into any of the commonplace MVSSL lineages, instead originating from a statistical mechanical perspective on the linear separability of data manifolds. In this paper, we seek to improve our understanding and our utilization of MMCR. To better understand MMCR, we leverage tools from high dimensional probability to demonstrate that MMCR incentivizes alignment and uniformity of learned embeddings. We then leverage tools from information theory to show that such embeddings maximize a well-known lower bound on mutual information between views, thereby connecting the geometric perspective of MMCR to the information-theoretic perspective commonly discussed in MVSSL. To better utilize MMCR, we mathematically predict and experimentally confirm non-monotonic changes in the pretraining loss akin to double descent but with respect to atypical hyperparameters. We also discover compute scaling laws that enable predicting the pretraining loss as a function of gradients steps, batch size, embedding dimension and number of views. We then show that MMCR, originally applied to image data, is performant on multimodal image-text data. By more deeply understanding the theoretical and empirical behavior of MMCR, our work reveals insights on improving MVSSL methods.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to deeply understand and optimize a multi - view self - supervised learning (MVSSL) method named Maximum Manifold Capacity Representations (MMCR). MMCR is a recently proposed MVSSL method, and its performance is comparable to or even better than the existing leading methods. However, the uniqueness of MMCR lies in that it does not fully conform to any common MVSSL classification but is derived from the statistical mechanics perspective of the linear separability of data manifolds. Specifically, the paper mainly focuses on the following aspects: 1. **Understanding the working mechanism of MMCR**: - Use high - dimensional probability tools to prove that MMCR encourages the learned embeddings to have alignment and uniformity. - Utilize information - theoretic tools to show that these embeddings maximize a known lower bound of mutual information, thus connecting the geometric perspective of MMCR with the common MVSSL information - theoretic perspective. 2. **Optimizing the application of MMCR**: - Mathematically predict and experimentally verify the non - monotonic change of the pre - training loss under certain atypical hyperparameters, similar to the double - descent phenomenon, but involving unusual hyperparameters (such as the number of data points \( P \) and the embedding dimension \( D \)). - Discover the computational scaling law, enabling the pre - training loss to be predicted as a function of the number of gradient steps, batch size, embedding dimension, and number of views. 3. **Expanding the application range of MMCR**: - Apply MMCR to multi - modal image - text data and show its performance in multi - modal tasks. The study found that MMCR outperforms CLIP at a smaller batch size and performs worse than CLIP at a larger batch size, indicating that MMCR may need to increase both the batch size and the embedding dimension simultaneously to achieve better performance. Through the above work, the paper not only deepens the understanding of MMCR but also provides new insights and directions for improving MVSSL methods. ### Formula summary 1. **MMCR pre - training loss**: \[ L_{\text{MMCR}} = - \|C\|_* = - \sum_{r = 1}^{\min(P, D)} \sigma_r(C) \] where \( C \) is a \( P\times D \) matrix, each row is the embedding center \( c_p \) of each data point, \( \sigma_r(C) \) is the \( r \)-th singular value of matrix \( C \), and \( \|C\|_* \) is the nuclear norm (i.e., trace norm or Schatten 1 - norm) of matrix \( C \). 2. **Pre - training percentage error**: \[ \text{Pretraining Percent Error}(C)=\frac{\sqrt{P\cdot\min(P, D)}-\|C\|_*}{\sqrt{P\cdot\min(P, D)}} \] 3. **Mathematical description of the double - descent phenomenon**: When \( P = D \), the pre - training percentage error reaches its peak, and when \( P\neq D \), the error gradually decreases. This phenomenon can be described by the following formula: \[ \text{Pretraining Percent Error}(C)=\frac{\sqrt{P\cdot\min(P, D)}-\|C\|_*}{\sqrt{P\cdot\min(P, D)}} \] 4. **Lower bound of mutual information**: \[ I[Z^{(1)}; Z^{(2)}]\geq\mathbb{E}_{p(Z^{(1)}, Z^{(2)})}[\log q(Z^{(1)}|Z^{(2)})]+H[Z

Towards an Improved Understanding and Utilization of Maximum Manifold Capacity Representations

Maximum Manifold Capacity Representations in State Representation Learning

Learning Efficient Coding of Natural Images with Maximum Manifold Capacity Representations

Manifold Embedding for Zero-Shot Recognition

Multimodal Understanding Through Correlation Maximization and Minimization

RMLR: Extending Multinomial Logistic Regression into General Geometries

The Role of Entropy and Reconstruction in Multi-View Self-Supervised Learning

MV-MR: multi-views and multi-representations for self-supervised learning and knowledge distillation

Manifold Regularized Cross-Modal Embedding for Zero-Shot Learning

Markov-Lipschitz Deep Learning

CAMVR: Context-Adaptive Multi-View Representation Learning for Dense Retrieval

Multiview Metric Learning with Global Consistency and Local Smoothness

Generalized Clustering and Multi-Manifold Learning with Geometric Structure Preservation

Multiview Concept Learning Via Deep Matrix Factorization

MV–MR: Multi-Views and Multi-Representations for Self-Supervised Learning and Knowledge Distillation

Sharable and Individual Multi-View Metric Learning.

Neural Manifold Clustering and Embedding

SPD Manifold Deep Metric Learning for Image Set Classification

Multi-view image clustering based on sparse coding and manifold consensus

Efficient Maximal Coding Rate Reduction by Variational Forms

MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding