Speech Emotion Recognition Using Multi-Modal Feature Fusion Network

Feng Li,Jiusong Luo,Wei Liu
DOI: https://doi.org/10.1109/prai59366.2023.10332053
2023-01-01
Abstract:Speech Emotion recognition (SER) aims to help machines recognize human emotions from utterances. Emotions are usually presented as audio and text in utterances. However, existing SER studies mostly use a single modality or a single feature, ignoring the integration of multiple modalities features. In this paper, we propose a multi-modal feature fusion network using cross-modal attention to extract emotion-related features from audio and text. More specifically, we use both audio and text, where the audio modality utilizes both wav2vec 2.0 features and Mel spectrograms. First, we combine text and wav2vec 2.0 features using the cross-modal fusion attention (CMFA) module. Then, the multi-modal fusion features are produced by concatenating the CMFA module’s multi-modal output with Mel spectrograms across MobileNet. Finally, comprehensive testing on the IEMOCAP dataset has shown that our proposed method significantly outperforms state-of-the-art approaches.
What problem does this paper attempt to address?