KDC-MAE: Knowledge Distilled Contrastive Mask Auto-Encoder

Maheswar Bora,Saurabh Atreya,Aritra Mukherjee,Abhijit Das
2024-11-19
Abstract:In this work, we attempted to extend the thought and showcase a way forward for the Self-supervised Learning (SSL) learning paradigm by combining contrastive learning, self-distillation (knowledge distillation) and masked data modelling, the three major SSL frameworks, to learn a joint and coordinated representation. The proposed technique of SSL learns by the collaborative power of different learning objectives of SSL. Hence to jointly learn the different SSL objectives we proposed a new SSL architecture KDC-MAE, a complementary masking strategy to learn the modular correspondence, and a weighted way to combine them coordinately. Experimental results conclude that the contrastive masking correspondence along with the KD learning objective has lent a hand to performing better learning for multiple modalities over multiple tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to learn joint and coordinated multi - modal representations by combining contrastive learning, self - distillation (a form of knowledge distillation), and masked data modeling, these three main self - supervised learning (SSL) frameworks. Specifically, the author proposes a new SSL architecture - KDC - MAE (Knowledge Distilled Contrastive Mask Auto - Encoder), aiming to cooperatively learn different SSL objectives through complementary masking strategies and weighting methods, thereby improving the learning effect on multiple tasks. ### Summary of the core issues in the paper: 1. **Limitations of existing SSL methods**: - When contrastive learning, mask modeling, or self - distillation are used alone, these methods perform inconsistently in different scenarios and cannot fully utilize their respective advantages. 2. **Requirement for joint learning**: - A method is needed to combine the above three SSL frameworks to find the mutual correspondence between them, so as to achieve more powerful multi - modal representation learning. 3. **The proposed new method**: - KDC - MAE enables the model to find the correspondence between modalities in the encoding space by introducing complementary masking strategies and self - distillation techniques, and is optimized by KL divergence loss. - Verified by experiments, this method shows better performance on multi - modal tasks. ### Formula explanation: - **Contrastive loss \( L_c \)**: \[ L_c = -\frac{1}{N} \sum_{i = 1}^{N} \log \left( \frac{\exp(s_{i,i}/\tau)}{\sum_{k \neq i} \exp(s_{i,k}/\tau) + \exp(s_{i,i}/\tau)} \right) \] where \( s_{i,j} = \|\mathbf{c}_v^i\|^T \|\mathbf{c}_a^j\| \), and \(\tau\) is the temperature parameter. - **Reconstruction loss \( L_r \)**: \[ L_r = \frac{1}{N} \sum_{i = 1}^{N} \left[ \frac{\sum (\hat{a}_\mu^i - \text{norm}(a_\mu^i))^2}{|a_\mu^i|} + \frac{\sum (\hat{v}_\mu^i - \text{norm}(v_\mu^i))^2}{|v_\mu^i|} \right] \] where \( N \) is the mini - batch size, and \( a_\mu, v_\mu, \hat{a}_\mu, \hat{v}_\mu \) represent the original and predicted masked blocks respectively. - **Self - distillation loss \( L_{kd} \)**: \[ L_{kd}(p_1, p_2) = \frac{D(p_1 \| p_2) + D(p_2 \| p_1)}{2} \] where \( D(p_1 \| p_2) \) is the KL divergence between two probability distributions \( p_1 \) and \( p_2 \). ### Conclusion: By combining contrastive learning, mask modeling, and self - distillation, KDC - MAE can achieve better performance on multi - modal tasks, especially in the joint representation learning of audio and video. The experimental results show that this joint learning method can significantly improve the performance of the model.