Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition

Dingkang Yang,Shuai Huang,Yang Liu,Lihua Zhang
DOI: https://doi.org/10.1109/lsp.2022.3210836
2022-10-22
IEEE Signal Processing Letters
Abstract:Speech emotion recognition combining linguistic content and audio signals in the dialog is a challenging task. Nevertheless, previous approaches have failed to explore emotion cues in contextual interactions and ignored the long-range dependencies between elements from different modalities. To tackle the above issues, this letter proposes a multimodal speech emotion recognition method using audio and text data. We first present a contextual transformer module to introduce contextual information via embedding the previous utterances between interlocutors, which enhances the emotion representation of the current utterance. Then, the proposed cross-modal transformer module focuses on the interactions between text and audio modalities, adaptively promoting the fusion from one modality to another. Furthermore, we construct associative topological relation over mini-batch and learn the association between deep fused features with graph convolutional network. Experimental results on the IEMOCAP and MELD datasets show that our method outperforms current state-of-the-art methods.
engineering, electrical & electronic
What problem does this paper attempt to address?