MSD: Saliency-aware Knowledge Distillation for Multimodal Understanding

Woojeong Jin,Maziar Sanjabi,Shaoliang Nie,Liang Tan,Xiang Ren,Hamed Firooz
DOI: https://doi.org/10.48550/arXiv.2101.01881
2021-10-22
Abstract:To reduce a model size but retain performance, we often rely on knowledge distillation (KD) which transfers knowledge from a large "teacher" model to a smaller "student" model. However, KD on multimodal datasets such as vision-language tasks is relatively unexplored, and digesting multimodal information is challenging since different modalities present different types of information. In this paper, we perform a large-scale empirical study to investigate the importance and effects of each modality in knowledge distillation. Furthermore, we introduce a multimodal knowledge distillation framework, modality-specific distillation (MSD), to transfer knowledge from a teacher on multimodal tasks by learning the teacher's behavior within each modality. The idea aims at mimicking a teacher's modality-specific predictions by introducing auxiliary loss terms for each modality. Furthermore, because each modality has different saliency for predictions, we define saliency scores for each modality and investigate saliency-based weighting schemes for the auxiliary losses. We further study a weight learning approach to learn the optimal weights on these loss terms. In our empirical analysis, we examine the saliency of each modality in KD, demonstrate the effectiveness of the weighting scheme in MSD, and show that it achieves better performance than KD on four multimodal datasets.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to effectively transfer knowledge from large teacher models to small student models when performing knowledge distillation (KD) on multimodal datasets, especially in vision - language tasks. Specifically, the paper focuses on the following points: 1. **Challenges of Knowledge Distillation for Multimodal Data**: Traditional knowledge distillation methods are insufficient when dealing with multimodal data because the information types of different modalities are different, resulting in a significant difference in the performance of the student model under single - modal input compared to the teacher model (as shown in Figure 1). 2. **Modality - Specific Distillation (MSD)**: To reduce this gap, the paper introduces a new framework - Modality - Specific Distillation (MSD). MSD encourages the student model to better imitate the behavior of the teacher model in each modality by introducing an auxiliary loss term for each modality. 3. **Modality Salience Analysis**: The paper quantifies the importance of each modality for prediction by defining a modality salience score and explores salience - based weight allocation schemes to optimize the knowledge transfer effect. 4. **Weight Learning**: In addition to the fixed - weight scheme, the paper also proposes a meta - learning method to automatically learn the optimal weight for each sample, further improving the performance of the student model. ### Main Contributions 1. **Large - scale Empirical Research**: The paper conducts large - scale empirical research and analyzes the importance and influence of each modality in knowledge distillation. 2. **MSD Framework**: Proposes the Modality - Specific Distillation (MSD) framework, which improves the performance of the student model on multimodal tasks by introducing modality - specific auxiliary loss terms. 3. **Salience - Weighting Scheme**: Defines the modality salience score and proposes multiple salience - based weight allocation schemes to optimize the knowledge transfer effect. 4. **Weight - Learning Method**: Proposes a meta - learning method to automatically learn the optimal weight for each sample, further improving the performance of the student model. ### Experimental Results The paper conducts experiments on four multimodal datasets (Hateful - Memes, MM - IMDB, SNLI - VE, VQA2), and the results show that: - **MSD Method Outperforms Traditional KD**: MSD significantly outperforms traditional knowledge distillation methods on all datasets. - **Salience - Weighting is Effective**: The salience - based weight allocation scheme performs better on some datasets, especially on the Hateful - Memes dataset. - **Weight - Learning is the Best**: The weight - learning method performs best in most cases and can more effectively imitate the behavior of the teacher model. - **Effectiveness across KD Methods**: The MSD method is not only applicable to traditional KD but can also be combined with other distillation methods to further improve performance. ### Conclusion The paper solves the challenges of knowledge distillation on multimodal datasets by introducing the Modality - Specific Distillation (MSD) framework, significantly improving the performance of the student model. Through salience analysis and weight learning, MSD can more effectively extract and transfer knowledge from the teacher model, making the student model perform better on multimodal tasks.