Abstract:Human multimodal emotion recognition (MER) aims to perceive human emotions via language, visual and acoustic modalities. Despite the impressive performance of previous MER approaches, the inherent multimodal heterogeneities still haunt and the contribution of different modalities varies significantly. In this work, we mitigate this issue by proposing a decoupled multimodal distillation (DMD) approach that facilitates flexible and adaptive crossmodal knowledge distillation, aiming to enhance the discriminative features of each modality. Specially, the representation of each modality is decoupled into two parts, i.e., modality-irrelevant/-exclusive spaces, in a self-regression manner. DMD utilizes a graph distillation unit (GD-Unit) for each decoupled part so that each GD can be performed in a more specialized and effective manner. A GD-Unit consists of a dynamic graph where each vertice represents a modality and each edge indicates a dynamic knowledge distillation. Such GD paradigm provides a flexible knowledge transfer manner where the distillation weights can be automatically learned, thus enabling diverse crossmodal knowledge transfer patterns. Experimental results show DMD consistently obtains superior performance than state-of-the-art MER methods. Visualization results show the graph edges in DMD exhibit meaningful distributional patterns w.r.t. the modality-irrelevant/-exclusive feature spaces. Codes are released at \url{<a class="link-external link-https" href="https://github.com/mdswyz/DMD" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in multimodal emotion recognition (MER), the heterogeneity between different modalities leads to performance differences and difficulties in knowledge transfer. Specifically, the language, visual, and auditory modalities each have their own characteristics when conveying emotional information, and the heterogeneity between these modalities increases the difficulty of effective fusion and representation learning. Although existing MER methods have made significant progress in performance, different modalities vary greatly in their contributions to emotion recognition, and the effect of direct cross - modal knowledge transfer is not good. To alleviate this problem, the authors propose a decoupled multimodal distillation (DMD) method. The main contributions of DMD are as follows: 1. **Feature Decoupling**: DMD decouples the features of each modality into two parts: modality - irrelevant and modality - exclusive. It predicts the decoupled features through an autoregressive mechanism and regresses these features in a self - supervised manner. At the same time, margin loss and soft orthogonality loss are introduced to further strengthen feature decoupling and reduce information redundancy. 2. **Graph Distillation Unit (GD - Unit)**: DMD uses graph distillation units in the decoupled feature space for flexible knowledge transfer. Specifically, DMD contains two graph distillation units: homogeneous graph knowledge distillation (HomoGD) and heterogeneous graph knowledge distillation (HeteroGD). HomoGD is used for the mutual distillation of homogeneous features to compensate for the representational capabilities of each modality; HeteroGD explicitly constructs the associations and semantic alignments between modalities through a multimodal transformer, thereby achieving effective cross - modal knowledge transfer. 3. **Experimental Verification**: The experimental results show that DMD achieves better performance than existing methods on multiple public MER datasets (such as CMU - MOSI and CMU - MOSEI). Visualization results show that DMD exhibits a meaningful distribution pattern in the decoupled feature space, verifying the effectiveness of the method. Through the above methods, DMD effectively addresses the challenges brought by the heterogeneity between different modalities in multimodal emotion recognition and improves the accuracy and robustness of emotion recognition.

Decoupled Multimodal Distilling for Emotion Recognition

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Incomplete Multimodality-Diffused Emotion Recognition

Fine-grained Disentangled Representation Learning for Multimodal Emotion Recognition

CMATH: Cross-Modality Augmented Transformer with Hierarchical Variational Distillation for Multimodal Emotion Recognition in Conversation

Multiplex graph aggregation and feature refinement for unsupervised incomplete multimodal emotion recognition

Muti-modal Emotion Recognition Via Hierarchical Knowledge Distillation

Multimodal Emotion Distribution Learning.

A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face

Dense Graph Convolutional with Joint Cross-Attention Network for Multimodal Emotion Recognition

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG

A Versatile Multimodal Learning Framework For Zero-shot Emotion Recognition

Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition

Correlation-Driven Multi-Modality Graph Decomposition for Cross-Subject Emotion Recognition

Label Distribution Adaptation for Multimodal Emotion Recognition with Multi-label Learning

First-order Multi-label Learning with Cross-modal Interactions for Multimodal Emotion Recognition

Dynamic Emotion-Dependent Network with Relational Subgraph Interaction for Multimodal Emotion Recognition

A Dual Attention-based Modality-Collaborative Fusion Network for Emotion Recognition

Multi-modal emotion recognition using tensor decomposition fusion and self-supervised multi-tasking

Speech Emotion Recognition Via Multi-Level Cross-Modal Distillation

A twin disentanglement Transformer Network with Hierarchical-Level Feature Reconstruction for robust multimodal emotion recognition