Dual Memory Fusion for Multimodal Speech Emotion Recognition

Simon Denman,C. Fookes,Darshana Prisayad,Tharindu Fernando,S. Sridharan
DOI: https://doi.org/10.21437/interspeech.2023-1090
2023-08-20
Abstract:Deep learning has been widely used in multi-modal Speech Emotion Recognition (SER) to learn sentiment-related features by aggregating representations from multiple modes. However, most SOTA methods use attentive fusion or late fusion of data which ignores the possibility of long-term dependencies among data. In this study, we propose a transformer-based SER architecture that fuses modality representations through explicit memory modules, where the information from current inputs is integrated with historical information allowing the model to understand the relative importance of modes over time. We have used Wav2Vec2 and BERT models to extract audio and text features which are then fused together by aggregating features from individual modes with information stored in memory, followed by downstream classification. Following state-of-the-art methods, we evaluate our proposed method on the IEMO-CAP dataset and results indicate that memory-based fusion can achieve substantial improvements.
Computer Science
What problem does this paper attempt to address?