Multimodal Emotion Recognition Calibration in Conversations

Geng Tu,Feng Xiong,Bin Liang,Hui Wang,Xi Zeng,Ruifeng Xu
DOI: https://doi.org/10.1145/3664647.3681515
2024-01-01
Abstract:Multimodal Emotion Recognition in Conversations (MERC) aims to identify the emotions conveyed by each utterance in a conversational video. Current efforts focus on modeling speaker-sensitive context dependencies and multimodal fusion. Despite the progress, the reliability of MERC methods remains largely unexplored. Extensive empirical studies reveal that current methods suffer from unreliable predictive confidence. Specifically, in some cases, the confidence estimated by these models increases when a modality or specific contextual cues are corrupted, defining these as uncertain samples. This contradicts the foundational principle in informatics, namely, the elimination of uncertainty. Based on this, we propose a novel calibration framework CMERC to calibrate MERC models without altering the model structure. It integrates curriculum learning to guide the model in progressively learning more uncertain samples; hybrid supervised contrastive learning to refine utterance representations, by pulling uncertain samples and others apart; and confidence constraint to penalize the model on uncertain samples. Experimental results on two datasets demonstrate the effectiveness and generalization capabilities of our CMERC across various MERC models, surpassing state-of-the-art methods.
What problem does this paper attempt to address?