MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Ziping Zhao,Tian Gao,Haishuai Wang,Björn Schuller
DOI: https://doi.org/10.21437/interspeech.2024-1735
2024-01-01
Abstract:Emotion recognition in conversation should not rely solely on discovering emotion keywords but also make comprehensive judgments after considering the context. To this end, we propose the MFDR to efficiently integrate acoustic and textual information. Specifically, acoustic-word combination and context perception are modeled sequentially in stages through the Sliding Adaptive Window Attention (SAWA) and Gated Context Perception Unit. More importantly, without additional memory overhead, SAWA allows the perception range to be adaptively adjusted according to the correlation strength to solve the misalignment and information loss caused by window truncation, modeling fusion under variable granularity. Furthermore, emotion refinement through Dynamic Frame Convolution strips out emotion-irrelevant frames, thereby generating a compact and emotionally discriminative fusion representation. The efficacy of MFDR is confirmed by IEMOCAP and CMU-MOSEI, where it demonstrates promising performance.
What problem does this paper attempt to address?