Language-guided Multi-modal Emotional Mimicry Intensity Estimation

Feng Qiu,Wei Zhang,Chen Liu,Lincheng Li,Heming Du,Tianchen Guo,Xin Yu
DOI: https://doi.org/10.1109/cvprw63382.2024.00477
2024-01-01
Computer Vision and Pattern Recognition
Abstract:Emotional Mimicry Intensity (EMI) estimation aims to identify the intensity of mimicry exhibited by individuals in response to observed emotions. The challenge in EMI estimation lies in discerning nuanced facial expression cues on mimicry behaviors based on the seed video and the text instructions. In this paper, we propose a multi-modal EMI estimation framework by leveraging visual, auditory, and textual modalities to capture a comprehensive emotional profile. We first extract representations for each modality separately and then fuse the modality-specific representations via a Temporal Segment Network, optimizing for temporal coherence and emotional context. Furthermore, we find that participants demonstrate notable proficiency in mimicking text instructions, yet exhibit less effectiveness in replicating facial expressions and vocal tones. In light of this, we design a contrastive learning mechanism to refine the extracted feature based on textual guidance. By doing so, features derived from similar text instructions are closely aligned, enhancing the estimation of emotional mimicry intensity by leveraging the dominant textual modality. Experiments conducted on the Hume-Vidmimic2 dataset illustrate the effectiveness of our framework in EMI estimation. Our framework is recognized as the leading solution in the Emotional Mimicry Intensity (EMI) Estimation Challenge at the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW). More information for the Competition can be found in: 6th ABAW.
What problem does this paper attempt to address?