Decoupled Multimodal Distilling for Emotion Recognition

Yong Li,Yuanzhi Wang,Zhen Cui
2023-03-24
Abstract:Human multimodal emotion recognition (MER) aims to perceive human emotions via language, visual and acoustic modalities. Despite the impressive performance of previous MER approaches, the inherent multimodal heterogeneities still haunt and the contribution of different modalities varies significantly. In this work, we mitigate this issue by proposing a decoupled multimodal distillation (DMD) approach that facilitates flexible and adaptive crossmodal knowledge distillation, aiming to enhance the discriminative features of each modality. Specially, the representation of each modality is decoupled into two parts, i.e., modality-irrelevant/-exclusive spaces, in a self-regression manner. DMD utilizes a graph distillation unit (GD-Unit) for each decoupled part so that each GD can be performed in a more specialized and effective manner. A GD-Unit consists of a dynamic graph where each vertice represents a modality and each edge indicates a dynamic knowledge distillation. Such GD paradigm provides a flexible knowledge transfer manner where the distillation weights can be automatically learned, thus enabling diverse crossmodal knowledge transfer patterns. Experimental results show DMD consistently obtains superior performance than state-of-the-art MER methods. Visualization results show the graph edges in DMD exhibit meaningful distributional patterns w.r.t. the modality-irrelevant/-exclusive feature spaces. Codes are released at \url{<a class="link-external link-https" href="https://github.com/mdswyz/DMD" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in multimodal emotion recognition (MER), the heterogeneity between different modalities leads to performance differences and difficulties in knowledge transfer. Specifically, the language, visual, and auditory modalities each have their own characteristics when conveying emotional information, and the heterogeneity between these modalities increases the difficulty of effective fusion and representation learning. Although existing MER methods have made significant progress in performance, different modalities vary greatly in their contributions to emotion recognition, and the effect of direct cross - modal knowledge transfer is not good. To alleviate this problem, the authors propose a decoupled multimodal distillation (DMD) method. The main contributions of DMD are as follows: 1. **Feature Decoupling**: DMD decouples the features of each modality into two parts: modality - irrelevant and modality - exclusive. It predicts the decoupled features through an autoregressive mechanism and regresses these features in a self - supervised manner. At the same time, margin loss and soft orthogonality loss are introduced to further strengthen feature decoupling and reduce information redundancy. 2. **Graph Distillation Unit (GD - Unit)**: DMD uses graph distillation units in the decoupled feature space for flexible knowledge transfer. Specifically, DMD contains two graph distillation units: homogeneous graph knowledge distillation (HomoGD) and heterogeneous graph knowledge distillation (HeteroGD). HomoGD is used for the mutual distillation of homogeneous features to compensate for the representational capabilities of each modality; HeteroGD explicitly constructs the associations and semantic alignments between modalities through a multimodal transformer, thereby achieving effective cross - modal knowledge transfer. 3. **Experimental Verification**: The experimental results show that DMD achieves better performance than existing methods on multiple public MER datasets (such as CMU - MOSI and CMU - MOSEI). Visualization results show that DMD exhibits a meaningful distribution pattern in the decoupled feature space, verifying the effectiveness of the method. Through the above methods, DMD effectively addresses the challenges brought by the heterogeneity between different modalities in multimodal emotion recognition and improves the accuracy and robustness of emotion recognition.