Learning Robust Multi-Modal Representation for Multi-Label Emotion Recognition Via Adversarial Masking and Perturbation

Shiping Ge,Zhiwei Jiang,Zifeng Cheng,Cong Wang,Yafeng Yin,Qing Gu
DOI: https://doi.org/10.1145/3543507.3583258
2023-01-01
Abstract:Recognizing emotions from multi-modal data is an emotion recognition task that requires strong multi-modal representation ability. The general approach to this task is to naturally train the representation model on training data without intervention. However, such natural training scheme is prone to modality bias of representation (i.e., tending to over-encode some informative modalities while neglecting other modalities) and data bias of training (i.e., tending to overfit training data). These biases may lead to instability (e.g., performing poorly when the neglected modality is dominant for recognition) and weak generalization (e.g., performing poorly when unseen data is inconsistent with overfitted data) of the model on unseen data. To address these problems, this paper presents two adversarial training strategies to learn more robust multi-modal representation for multi-label emotion recognition. Firstly, we propose an adversarial temporal masking strategy, which can enhance the encoding of other modalities by masking the most emotion-related temporal units (e.g., words for text or frames for video) of the informative modality. Secondly, we propose an adversarial parameter perturbation strategy, which can enhance the generalization of the model by adding the adversarial perturbation to the parameters of model. Both strategies boost model performance on the benchmark MMER datasets CMU-MOSEI and NEMu. Experimental results demonstrate the effectiveness of the proposed method compared with the previous state-of-the-art method. Code will be released at https://github.com/ShipingGe/MMER.
What problem does this paper attempt to address?