Abstract:Semantic-rich speech emotion recognition has a high degree of popularity in a range of areas. Speech emotion recognition aims to recognize human emotional states from utterances containing both acoustic and linguistic information. Since both textual and audio patterns play essential roles in speech emotion recognition (SER) tasks, various works have proposed novel modality fusing methods to exploit text and audio signals effectively. However, most of the high performance of existing models is dependent on a great number of learnable parameters, and they can only work well on data with fixed length. Therefore, minimizing computational overhead and improving generalization to unseen data with various lengths while maintaining a certain level of recognition accuracy is an urgent application problem. In this paper, we propose LGCCT, a light gated and crossed complementation transformer for multimodal speech emotion recognition. First, our model is capable of fusing modality information efficiently. Specifically, the acoustic features are extracted by CNN-BiLSTM while the textual features are extracted by BiLSTM. The modality-fused representation is then generated by the cross-attention module. We apply the gate-control mechanism to achieve the balanced integration of the original modality representation and the modality-fused representation. Second, the degree of attention focus can be considered, as the uncertainty and the entropy of the same token should converge to the same value independent of the length. To improve the generalization of the model to various testing-sequence lengths, we adopt the length-scaled dot product to calculate the attention score, which can be interpreted from a theoretical view of entropy. The operation of the length-scaled dot product is cheap but effective. Experiments are conducted on the benchmark dataset CMU-MOSEI. Compared to the baseline models, our model achieves an 81.0% F1 score with only 0.432 M parameters, showing an improvement in the balance between performance and the number of parameters. Moreover, the ablation study signifies the effectiveness of our model and its scalability to various input-sequence lengths, wherein the relative improvement is almost 20% of the baseline without a length-scaled dot product.

MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition

Emotion Recognition in Videos via Fusing Multimodal Features.

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

An autoencoder-based feature level fusion for speech emotion recognition

Multi-head attention fusion networks for multi-modal speech emotion recognition

Graph-based multi-Feature fusion method for speech emotion recognition

MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations

MF-Net: a multimodal fusion network for emotion recognition based on multiple physiological signals

Speech emotion recognition based on multi-dimensional feature extraction and multi-scale feature fusion

Multimodal emotion recognition from facial expression and speech based on feature fusion

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition

GCF2-Net: global-aware cross-modal feature fusion network for speech emotion recognition

Audio-Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition

Multimodal transformer augmented fusion for speech emotion recognition

LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition

Multimodal Emotion Recognition Based on Cascaded Multichannel and Hierarchical Fusion

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition

Nemoossis, a new genus for the eastern Atlantic long-fin bonefish Pterothrissusbelloci Cadenat 1937 and a redescription of P.gissu Hilgendorf 1877 from the northwestern Pacific

Multi-level attention fusion network assisted by relative entropy alignment for multimodal speech emotion recognition