Abstract:In this work, we attempted to extend the thought and showcase a way forward for the Self-supervised Learning (SSL) learning paradigm by combining contrastive learning, self-distillation (knowledge distillation) and masked data modelling, the three major SSL frameworks, to learn a joint and coordinated representation. The proposed technique of SSL learns by the collaborative power of different learning objectives of SSL. Hence to jointly learn the different SSL objectives we proposed a new SSL architecture KDC-MAE, a complementary masking strategy to learn the modular correspondence, and a weighted way to combine them coordinately. Experimental results conclude that the contrastive masking correspondence along with the KD learning objective has lent a hand to performing better learning for multiple modalities over multiple tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to learn joint and coordinated multi - modal representations by combining contrastive learning, self - distillation (a form of knowledge distillation), and masked data modeling, these three main self - supervised learning (SSL) frameworks. Specifically, the author proposes a new SSL architecture - KDC - MAE (Knowledge Distilled Contrastive Mask Auto - Encoder), aiming to cooperatively learn different SSL objectives through complementary masking strategies and weighting methods, thereby improving the learning effect on multiple tasks. ### Summary of the core issues in the paper: 1. **Limitations of existing SSL methods**: - When contrastive learning, mask modeling, or self - distillation are used alone, these methods perform inconsistently in different scenarios and cannot fully utilize their respective advantages. 2. **Requirement for joint learning**: - A method is needed to combine the above three SSL frameworks to find the mutual correspondence between them, so as to achieve more powerful multi - modal representation learning. 3. **The proposed new method**: - KDC - MAE enables the model to find the correspondence between modalities in the encoding space by introducing complementary masking strategies and self - distillation techniques, and is optimized by KL divergence loss. - Verified by experiments, this method shows better performance on multi - modal tasks. ### Formula explanation: - **Contrastive loss \( L_c \)**: \[ L_c = -\frac{1}{N} \sum_{i = 1}^{N} \log \left( \frac{\exp(s_{i,i}/\tau)}{\sum_{k \neq i} \exp(s_{i,k}/\tau) + \exp(s_{i,i}/\tau)} \right) \] where \( s_{i,j} = \|\mathbf{c}_v^i\|^T \|\mathbf{c}_a^j\| \), and \(\tau\) is the temperature parameter. - **Reconstruction loss \( L_r \)**: \[ L_r = \frac{1}{N} \sum_{i = 1}^{N} \left[ \frac{\sum (\hat{a}_\mu^i - \text{norm}(a_\mu^i))^2}{|a_\mu^i|} + \frac{\sum (\hat{v}_\mu^i - \text{norm}(v_\mu^i))^2}{|v_\mu^i|} \right] \] where \( N \) is the mini - batch size, and \( a_\mu, v_\mu, \hat{a}_\mu, \hat{v}_\mu \) represent the original and predicted masked blocks respectively. - **Self - distillation loss \( L_{kd} \)**: \[ L_{kd}(p_1, p_2) = \frac{D(p_1 \| p_2) + D(p_2 \| p_1)}{2} \] where \( D(p_1 \| p_2) \) is the KL divergence between two probability distributions \( p_1 \) and \( p_2 \). ### Conclusion: By combining contrastive learning, mask modeling, and self - distillation, KDC - MAE can achieve better performance on multi - modal tasks, especially in the joint representation learning of audio and video. The experimental results show that this joint learning method can significantly improve the performance of the model.

KDC-MAE: Knowledge Distilled Contrastive Mask Auto-Encoder

Understanding Masked Autoencoders From a Local Contrastive Perspective

A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond

CL-MAE: Curriculum-Learned Masked Autoencoders

How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders

Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders

Masked Contrastive Representation Learning

Contrastive Masked Autoencoders are Stronger Vision Learners

SdAE: Self-distillated Masked Autoencoder

Teaching Masked Autoencoder With Strong Augmentations

GraphMAE2: A Decoding-Enhanced Masked Self-Supervised Graph Learner

Contrastive Audio-Visual Masked Autoencoder

MetaMask: Revisiting Dimensional Confounder for Self-Supervised Learning

Self-supervised Auxiliary Learning for Texture and Model-based Hybrid Robust and Fair Featuring in Face Analysis

A simple, efficient and scalable contrastive masked autoencoder for learning visual representations

Masked Autoencoders Are Stronger Knowledge Distillers

Self-distillation Augmented Masked Autoencoders for Histopathological Image Understanding

Masked Autoencoders are Parameter-Efficient Federated Continual Learners

KDMCSE: Knowledge Distillation Multimodal Sentence Embeddings with Adaptive Angular margin Contrastive Learning

Mask-Enhanced Contrastive Learning for Hyperspectral Image Classification

CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using Cochlear Cepstrum-based Masking for Speech Emotion Recognition