Cross-modal Features Interaction-and-Aggregation Network with Self-consistency Training for Speech Emotion Recognition

Ying Hu,Huamin Yang,Hao Huang,Liang He
DOI: https://doi.org/10.21437/interspeech.2024-1733
2024-01-01
Abstract:In recent years, much research has been into speech emotion recognition (SER) using multimodal data. Selective fusion of the features from different modalities is critical for multimodal SER. In this paper, we propose a cross-modal features interaction-and-aggregation network (CFIA-Net) with self-consistency training for SER. Specifically, we design a cross-modal features interaction-and-aggregation (CFIA) module to adaptively interact and integrate the features of audio and text modalities. Moreover, we introduce a self-consistency training strategy, which exploits the features from deeper layers to supervise those from shallower ones to obtain the SER task-related information. The experimental results show that compared with other bimodal SER methods, the CFIA-Net achieves the state-of-the-art performance on the weighted accuracy (WA) of 83.37% and unweighted accuracy (UA) of 83.67% on the IEMOCAP dataset.
What problem does this paper attempt to address?