Abstract:In this work, we attempted to extend the thought and showcase a way forward for the Self-supervised Learning (SSL) learning paradigm by combining contrastive learning, self-distillation (knowledge distillation) and masked data modelling, the three major SSL frameworks, to learn a joint and coordinated representation. The proposed technique of SSL learns by the collaborative power of different learning objectives of SSL. Hence to jointly learn the different SSL objectives we proposed a new SSL architecture KDC-MAE, a complementary masking strategy to learn the modular correspondence, and a weighted way to combine them coordinately. Experimental results conclude that the contrastive masking correspondence along with the KD learning objective has lent a hand to performing better learning for multiple modalities over multiple tasks.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to learn joint and coordinated multi - modal representations by combining contrastive learning, self - distillation (a form of knowledge distillation), and masked data modeling, these three main self - supervised learning (SSL) frameworks. Specifically, the author proposes a new SSL architecture - KDC - MAE (Knowledge Distilled Contrastive Mask Auto - Encoder), aiming to cooperatively learn different SSL objectives through complementary masking strategies and weighting methods, thereby improving the learning effect on multiple tasks.
### Summary of the core issues in the paper:
1. **Limitations of existing SSL methods**:
- When contrastive learning, mask modeling, or self - distillation are used alone, these methods perform inconsistently in different scenarios and cannot fully utilize their respective advantages.
2. **Requirement for joint learning**:
- A method is needed to combine the above three SSL frameworks to find the mutual correspondence between them, so as to achieve more powerful multi - modal representation learning.
3. **The proposed new method**:
- KDC - MAE enables the model to find the correspondence between modalities in the encoding space by introducing complementary masking strategies and self - distillation techniques, and is optimized by KL divergence loss.
- Verified by experiments, this method shows better performance on multi - modal tasks.
### Formula explanation:
- **Contrastive loss \( L_c \)**:
\[
L_c = -\frac{1}{N} \sum_{i = 1}^{N} \log \left( \frac{\exp(s_{i,i}/\tau)}{\sum_{k \neq i} \exp(s_{i,k}/\tau) + \exp(s_{i,i}/\tau)} \right)
\]
where \( s_{i,j} = \|\mathbf{c}_v^i\|^T \|\mathbf{c}_a^j\| \), and \(\tau\) is the temperature parameter.
- **Reconstruction loss \( L_r \)**:
\[
L_r = \frac{1}{N} \sum_{i = 1}^{N} \left[ \frac{\sum (\hat{a}_\mu^i - \text{norm}(a_\mu^i))^2}{|a_\mu^i|} + \frac{\sum (\hat{v}_\mu^i - \text{norm}(v_\mu^i))^2}{|v_\mu^i|} \right]
\]
where \( N \) is the mini - batch size, and \( a_\mu, v_\mu, \hat{a}_\mu, \hat{v}_\mu \) represent the original and predicted masked blocks respectively.
- **Self - distillation loss \( L_{kd} \)**:
\[
L_{kd}(p_1, p_2) = \frac{D(p_1 \| p_2) + D(p_2 \| p_1)}{2}
\]
where \( D(p_1 \| p_2) \) is the KL divergence between two probability distributions \( p_1 \) and \( p_2 \).
### Conclusion:
By combining contrastive learning, mask modeling, and self - distillation, KDC - MAE can achieve better performance on multi - modal tasks, especially in the joint representation learning of audio and video. The experimental results show that this joint learning method can significantly improve the performance of the model.